11 Foundations of Probability

11.1 Why Probability Matters

In games of chance, probability has a very intuitive definition. We know what it means that the chance of a pair of dice coming up seven is 1 in 6. However, probability is used much more broadly today, with the word commonly appearing in everyday language. Google’s auto-complete of “What are the chances of” gives us: “having twins”, “rain today”, “getting struck by lightning”, and “getting cancer”. One goal of this chapter is to help us understand how probability is useful to understand and describe real-world events when performing data analysis.

Because knowing how to compute probabilities gives you an edge in games of chance, throughout history many smart individuals—including famous mathematicians such as Cardano, Fermat, and Pascal—spent time and energy thinking through the math of these games. As a result, Probability Theory was born. Probability continues to be highly useful in modern games of chance. For example, in poker, we can compute the probability of winning a hand based on the cards on the table. Casinos rely on probability theory to develop games that almost certainly guarantee a profit.

Probability theory is useful in many other contexts and, in particular, in areas that depend on data affected by chance in some way. All of the other chapters in this part build upon probability theory. Knowledge of probability is therefore indispensable for data science.

11.2 Two Interpretations of Probability

There are two main ways to think about what probability means.

The frequentist interpretation views probability as mathematically convenient approximations to long-run relative frequencies. If we say the probability of heads when flipping a fair coin is 0.5, we mean that if we flipped the coin many, many times, about half the flips would come up heads. This interpretation grounds probability in observable, repeatable phenomena.

The subjective (Bayesian) interpretation views probability as a measure of belief or uncertainty. A probability statement expresses the opinion of some individual regarding how certain an event is to occur, given their current information. This interpretation allows us to assign probabilities to one-time events and to update beliefs as we gather evidence.

Both interpretations have their uses, and modern statistics draws on both perspectives. For now, the frequentist interpretation provides good intuition for the concepts we will develop.

11.3 Random Variables and Sample Spaces

A random variable is a quantity that can take on different values with different probabilities. The outcome of a coin flip, the number of bacterial colonies on a plate, and the expression level of a gene are all random variables.

The sample space of a random variable is the set of all possible values it can take. For a coin flip, the sample space is {Heads, Tails}. For a die roll, it is {1, 2, 3, 4, 5, 6}. For the concentration of a protein, it might be any non-negative real number.

A probability distribution describes how likely each value in the sample space is:

For discrete random variables (those that take distinct values), we use a probability mass function that gives the probability of each possible value
For continuous random variables (those that can take any value in a range), we use a probability density function from which probabilities are calculated by integration

One fundamental rule: the probabilities across the entire sample space must sum (or integrate) to 1. Something from the sample space must happen.

11.4 Discrete Probability

We start by covering some basic principles related to categorical data. This subset of probability is referred to as discrete probability. It will help us understand the probability theory we will later introduce for numeric and continuous data, which is much more common in data science applications.

Relative Frequency

A precise definition of probability can be given by noting all possible outcomes and counting how many satisfy the condition for our event. For example, if we have 2 red beads and 3 blue beads inside an urn and we pick one at random, what is the probability of picking a red one?

Our intuition tells us that the answer is 2/5 or 40%. There are five possible outcomes, of which two satisfy the condition necessary for the event “pick a red bead”. Since each of the five outcomes has the same chance of occurring, we conclude that the probability is 0.4 for red and 0.6 for blue.

A more tangible way to think about the probability of an event is as the proportion of times the event occurs when we repeat the experiment an infinite number of times, independently, and under the same conditions.

Notation

We use the notation $\mbox{Pr}(A)$ to denote the probability of event $A$ happening. We use the very general term event to refer to things that can happen when something occurs by chance. In our previous example, the event was “picking a red bead”. In a political poll in which we call 100 likely voters at random, an example of an event is “calling 48 Democrats and 52 Republicans”.

In data science applications, we will often deal with continuous variables. These events will often be things like “is this person taller than 6 feet”. In this case, we write events in a more mathematical form: $X \geq 6$.

Probability Distributions for Categorical Data

If we know the relative frequency of the different categories, defining a distribution for categorical outcomes is straightforward. We simply assign a probability to each category. In cases that can be thought of as beads in an urn, for each bead type, their proportion defines the distribution.

If we are randomly calling likely voters from a population that is 44% Democrat, 44% Republican, 10% undecided, and 2% Green Party, these proportions define the probability for each group:

Group	Probability
Republican	0.44
Democrat	0.44
Undecided	0.10
Green	0.02

11.5 Distribution Moments and Parameters

Probability distributions can be characterized by their moments—metrics that describe the shape of the distribution. The first four moments correspond to important properties:

Mean ($\mu$) - the center or expected value
Variance ($\sigma^2$) - the spread or dispersion
Skewness - the asymmetry of the distribution
Kurtosis - the “tailedness” or peakedness

For a discrete random variable X, the expected value (mean) is:

\[E[X] = \sum_{\text{all } x} x \cdot P(X = x) = \mu\]

The variance measures dispersion around the mean:

\[\text{Var}(X) = E[(X - \mu)^2] = \sigma^2\]

These parameters are crucial because they describe real properties of the systems we study. In biology, for example, the mean height of a population tells us about the typical value, while the variance tells us about the diversity of heights among individuals.

11.6 Monte Carlo Simulations

Computers provide a way to actually perform random experiments. Random number generators permit us to mimic the process of picking at random. An example is the sample function in R.

First, we use the function rep to generate the urn:

Code

beads <- rep(c("red", "blue"), times = c(2, 3))
beads

[1] "red"  "red"  "blue" "blue" "blue"

Then use sample to pick a bead at random:

Code

sample(beads, 1)

[1] "red"

This line of code produces one random outcome. We want to repeat this experiment a large enough number of times to make the results practically equivalent to repeating forever. This is an example of a Monte Carlo simulation.

To perform our first Monte Carlo simulation, we use the replicate function, which permits us to repeat the same task any number of times. Here, we repeat the random event B = 10,000 times:

Code

set.seed(1986)  # For reproducibility
B <- 10000
events <- replicate(B, sample(beads, 1))

We can now see if our definition actually agrees with this Monte Carlo simulation approximation:

Code

tab <- table(events)
prop.table(tab)

events
  blue    red 
0.6014 0.3986

The numbers above are the estimated probabilities provided by this Monte Carlo simulation. Statistical theory tells us that as $B$ gets larger, the estimates get closer to 3/5 = 0.6 and 2/5 = 0.4.

With and Without Replacement

The function sample has an argument that permits us to pick more than one element from the urn. By default, this selection occurs without replacement: after a bead is selected, it is not put back in the bag.

Code

sample(beads, 5)

[1] "blue" "red"  "blue" "red"  "blue"

This results in rearrangements that always have three blue and two red beads because we can’t select more beads than exist.

However, we can sample with replacement: return the bead back to the urn after selecting it:

Code

events <- sample(beads, B, replace = TRUE)
prop.table(table(events))

events
 blue   red 
0.601 0.399

Not surprisingly, we get results very similar to those previously obtained with replicate.

11.7 The Bernoulli Distribution

The simplest probability distribution describes a single event with two possible outcomes—success or failure, yes or no, heads or tails. This is the Bernoulli distribution.

Consider flipping a fair coin once:

\[P(X = \text{Head}) = \frac{1}{2} = 0.5 = p\]

And the probability of tails is:

\[P(X = \text{Tail}) = \frac{1}{2} = 0.5 = 1 - p = q\]

If the coin is not fair, $p$ might differ from 0.5, but the probabilities still sum to 1:

\[p + (1-p) = 1\]

This same framework applies to any binary outcome: whether a patient responds to treatment, whether an allele is inherited from a parent, or whether a product passes quality control.

Code

# Flip a coin 1000 times
set.seed(42)
flips <- rbinom(1000, 1, 0.5)

barplot(table(flips) / 1000,
        names.arg = c("Tails", "Heads"),
        ylab = "Probability",
        ylim = c(0, 0.75),
        col = "steelblue",
        main = "Estimated Bernoulli Distribution")

Figure 11.1: Simulated Bernoulli distribution from 1000 coin flips showing approximately equal probability for heads and tails

11.8 Probability Rules

Two fundamental rules govern how probabilities combine. Most probability distributions can be built up from these simple rules.

The AND Rule (Multiplication)

The probability that two independent events both occur is the product of their individual probabilities. If you flip a coin twice:

\[P(\text{First = Head AND Second = Head}) = p \times p = p^2\]

More generally, for independent events A and B:

\[P(A \text{ and } B) = P(A) \times P(B)\]

For a fair coin with $p = 0.5$:

$P(\text{HH}) = 0.5 \times 0.5 = 0.25$
$P(\text{HT}) = 0.5 \times 0.5 = 0.25$
$P(\text{TH}) = 0.5 \times 0.5 = 0.25$
$P(\text{TT}) = 0.5 \times 0.5 = 0.25$

The OR Rule (Addition)

The probability that at least one of two mutually exclusive events occurs is the sum of their probabilities:

\[P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B)\]

For mutually exclusive events (those that cannot both occur), the intersection is empty:

\[P(A \text{ or } B) = P(A) + P(B)\]

The probability of getting exactly one head in two flips (either HT or TH):

\[P(\text{one head}) = P(\text{HT}) + P(\text{TH}) = 0.25 + 0.25 = 0.5\]

Multiplication Rule Under Independence

When events are independent, the multiplication rule simplifies:

\[P(A \text{ and } B \text{ and } C) = P(A) \times P(B) \times P(C)\]

But we must be careful—assuming independence can result in very different and incorrect probability calculations when we don’t actually have independence.

Independence Matters

Imagine a court case where the suspect was described as having a mustache and a beard. The defendant has both, and the prosecution brings in an “expert” who testifies that 1/10 men have beards and 1/5 have mustaches, concluding that only $1/10 \times 1/5 = 0.02$ have both.

But to multiply like this we need to assume independence! If the conditional probability of a man having a mustache given that he has a beard is 0.95, then the correct probability is much higher: $1/10 \times 0.95 = 0.095$.

11.9 Independence

We say two events are independent if the outcome of one does not affect the other. The classic example is coin tosses. Every time we toss a fair coin, the probability of seeing heads is 1/2 regardless of what previous tosses have revealed. The same is true when we pick beads from an urn with replacement.

Many examples of events that are not independent come from card games. When we deal the first card, the probability of getting a King is 1/13 since there are thirteen possibilities. Now if we deal a King for the first card and don’t replace it into the deck, the probability of a second card being a King is only 3/51. These events are not independent: the first outcome affected the next one.

Code

# Demonstrate non-independence with sequential draws
set.seed(1)
x <- sample(beads, 5)
x[1:4]  # First four draws

[1] "red"  "blue" "blue" "blue"

Code

x[5]    # If first four are blue, the fifth must be...

[1] "red"

If you have to guess the color of the first bead, you would predict blue since blue has a 60% chance. But if you know the first four were blue, the probability of the fifth being red is now 100%, not 40%. The events are not independent, so the probabilities change.

11.10 Conditional Probabilities

When events are not independent, conditional probabilities are useful. The conditional probability $P(B|A)$ is the probability of B given that A has occurred:

\[P(\text{Card 2 is a King} \mid \text{Card 1 is a King}) = \frac{3}{51}\]

We use the $\mid$ as shorthand for “given that” or “conditional on”.

When two events A and B are independent:

\[P(A \mid B) = P(A)\]

This is the mathematical definition of independence: the fact that B happened does not affect the probability of A happening.

The general multiplication rule relates joint and conditional probability:

\[P(A \text{ and } B) = P(A) \times P(B \mid A)\]

This can be extended to more events:

\[P(A \text{ and } B \text{ and } C) = P(A) \times P(B \mid A) \times P(C \mid A \text{ and } B)\]

11.11 Bayes’ Theorem

Rearranging the multiplication rule yields Bayes’ theorem, a cornerstone of probabilistic reasoning:

\[P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}\]

Bayes’ theorem tells us how to update our beliefs about A after observing B. In Bayesian statistics, this is written as:

\[P(\theta|d) = \frac{P(d|\theta) \times P(\theta)}{P(d)}\]

where:

$P(\theta|d)$ = posterior probability distribution
$P(d|\theta)$ = likelihood function for $\theta$
$P(\theta)$ = prior probability distribution
$P(d)$ = marginal likelihood (normalizing constant)

11.12 Likelihood vs. Probability

A subtle but important distinction exists between probability and likelihood.

Probability is the chance of observing particular data given a model or parameter value. If we know a coin has $p = 0.5$, what is the probability of observing 7 heads in 10 flips?

Likelihood is how well a parameter value explains observed data. Given that we observed 7 heads in 10 flips, how likely is it that the true probability is $p = 0.5$ versus $p = 0.7$?

Mathematically, the likelihood function uses the same formula as probability, but we think of it differently:

\[L(\text{parameter} | \text{data}) = P(\text{data} | \text{parameter})\]

Maximum likelihood estimation finds the parameter value that makes the observed data most probable—the value that maximizes the likelihood function.

11.13 Combinations and Permutations

For more complicated probability calculations, we need to count possibilities systematically. The gtools package provides useful functions.

Permutations (Order Matters)

A permutation is an arrangement where order matters. For any list of size n, the permutations function computes all different arrangements when selecting r items:

Code

# All ways to arrange 2 items from {1, 2, 3}
permutations(3, 2)

     [,1] [,2]
[1,]    1    2
[2,]    1    3
[3,]    2    1
[4,]    2    3
[5,]    3    1
[6,]    3    2

Notice that order matters: 3,1 is different than 1,3. Also, (1,1), (2,2), and (3,3) do not appear because once we pick a number, it can’t appear again.

Combinations (Order Doesn’t Matter)

A combination is a selection where order doesn’t matter:

Code

combinations(3, 2)

     [,1] [,2]
[1,]    1    2
[2,]    1    3
[3,]    2    3

The outcome (2,1) doesn’t appear because (1,2) already represents the same combination.

Example: Blackjack

Let’s compute the probability of getting a “Natural 21” in Blackjack—an Ace and a face card in the first two cards:

Code

# Build a deck
suits <- c("Diamonds", "Clubs", "Hearts", "Spades")
numbers <- c("Ace", "Deuce", "Three", "Four", "Five", "Six", "Seven",
             "Eight", "Nine", "Ten", "Jack", "Queen", "King")
deck <- expand.grid(number = numbers, suit = suits)
deck <- paste(deck$number, deck$suit)

# Define aces and face cards
aces <- paste("Ace", suits)
facecard <- c("King", "Queen", "Jack", "Ten")
facecard <- expand.grid(number = facecard, suit = suits)
facecard <- paste(facecard$number, facecard$suit)

# All possible two-card hands (order doesn't matter)
hands <- combinations(52, 2, v = deck)

# Probability of Natural 21
mean((hands[,1] %in% aces & hands[,2] %in% facecard) |
     (hands[,2] %in% aces & hands[,1] %in% facecard))

[1] 0.04826546

11.14 Classic Examples

The Monty Hall Problem

In the game show “Let’s Make a Deal,” contestants pick one of three doors. Behind one door is a car; behind the others are goats. After you pick a door, Monty Hall opens one of the remaining doors to reveal a goat. Then he asks: “Do you want to switch doors?”

Intuition suggests it shouldn’t matter—you’re choosing between two doors, so shouldn’t the probability be 50-50? Let’s use Monte Carlo simulation:

Code

B <- 10000
monty_hall <- function(strategy) {
  doors <- as.character(1:3)
  prize <- sample(c("car", "goat", "goat"))
  prize_door <- doors[prize == "car"]
  my_pick <- sample(doors, 1)
  show <- sample(doors[!doors %in% c(my_pick, prize_door)], 1)

  if (strategy == "stick") {
    choice <- my_pick
  } else {
    choice <- doors[!doors %in% c(my_pick, show)]
  }
  choice == prize_door
}

stick_wins <- replicate(B, monty_hall("stick"))
switch_wins <- replicate(B, monty_hall("switch"))

cat("Probability of winning when sticking:", mean(stick_wins), "\n")

Probability of winning when sticking: 0.3365

Code

cat("Probability of winning when switching:", mean(switch_wins), "\n")

Probability of winning when switching: 0.6643

Switching doubles your chances! The key insight: when you first pick, you have a 1/3 chance of being right. Monty’s reveal doesn’t change that. Since the probability the car is behind one of the other doors was 2/3, and Monty showed you which one doesn’t have it, switching gives you that 2/3 probability.

The Birthday Problem

In a room with 50 people, what’s the probability that at least two share a birthday?

Code

# Monte Carlo simulation
B <- 10000
same_birthday <- function(n) {
  bdays <- sample(1:365, n, replace = TRUE)
  any(duplicated(bdays))
}

results <- replicate(B, same_birthday(50))
cat("Probability with 50 people:", mean(results), "\n")

Probability with 50 people: 0.9701

Code

# How does this change with group size?
compute_prob <- function(n, B = 10000) {
  results <- replicate(B, same_birthday(n))
  mean(results)
}

n <- seq(1, 60)
prob <- sapply(n, compute_prob)
qplot(n, prob, geom = "line") +
  geom_hline(yintercept = 0.5, linetype = "dashed", color = "red") +
  labs(x = "Group Size", y = "Probability of Shared Birthday",
       title = "The Birthday Problem")

Figure 11.2: The probability of at least two people sharing a birthday increases rapidly with group size, exceeding 50% at just 23 people

With just 23 people, there’s already a 50% chance of a shared birthday! People tend to underestimate these probabilities because they think about the probability that someone shares their birthday, not the probability that any two people share a birthday.

We can also compute this exactly using the multiplication rule:

Code

# Probability that all n people have UNIQUE birthdays
exact_prob <- function(n) {
  prob_unique <- seq(365, 365 - n + 1) / 365
  1 - prod(prob_unique)
}

eprob <- sapply(n, exact_prob)
cat("Exact probability with 50 people:", exact_prob(50), "\n")

Exact probability with 50 people: 0.9703736

11.15 Covariance and Correlation

When two variables are not independent, they share information—knowing one tells you something about the other. This shared information is quantified by covariance, a measure of how two variables vary together.

If high values of X tend to occur with high values of Y (and low with low), the covariance is positive. If high values of X tend to occur with low values of Y, the covariance is negative. If there is no relationship, the covariance is near zero.

Correlation is covariance standardized to fall between -1 and 1, making it easier to interpret. A correlation of 1 means perfect positive linear relationship; -1 means perfect negative linear relationship; 0 means no linear relationship.

These concepts will become central when we discuss regression and other methods for relating variables to each other.

11.16 How Large is “Large Enough” for Monte Carlo?

The theory described here requires repeating experiments over and over forever. In practice, we can’t do this. In the examples above, we used $B = 10,000$ Monte Carlo experiments and it turned out to provide accurate estimates.

One practical approach is to check for the stability of the estimate:

Code

B_values <- 10^seq(1, 5, len = 100)
compute_prob_B <- function(B, n = 25) {
  same_day <- replicate(B, same_birthday(n))
  mean(same_day)
}

prob <- sapply(B_values, compute_prob_B)
qplot(log10(B_values), prob, geom = "line") +
  geom_hline(yintercept = exact_prob(25), color = "red", linetype = "dashed") +
  labs(x = "log10(Number of Simulations)", y = "Estimated Probability",
       title = "Monte Carlo Convergence")

Figure 11.3: Monte Carlo estimates converge to the true probability (red dashed line) as the number of simulations increases

The values start to stabilize (vary less than 0.01) around 1000 simulations. The exact probability is shown in red.

11.17 Summary

This chapter introduced the language of probability:

Random variables can take different values with different probabilities
The sample space contains all possible outcomes
Probability distributions describe how likely each outcome is
The AND rule (multiply) and OR rule (add) combine probabilities
Independence means one event doesn’t affect another’s probability
Conditional probability describes probability given other information
Bayes’ theorem updates beliefs based on new evidence
Monte Carlo simulations estimate probabilities through repeated random sampling
Classic problems like Monty Hall and birthdays reveal counterintuitive probability results

Understanding these foundations is essential for all statistical inference that follows.

11.18 Practice Exercises

Exercise P.1: Simulating Coin Flips

Use R to simulate coin flips:

Code

set.seed(123)

Simulate 100 fair coin flips using rbinom() or sample()
Calculate the proportion of heads
Repeat with 1000 and 10000 flips—how does the proportion change?
Create a histogram of results from many simulations

Code

# Simulate coin flips
n_flips <- 1000
flips <- rbinom(n_flips, size = 1, prob = 0.5)
mean(flips)  # Proportion of heads (1s)

# Or using sample
flips <- sample(c("H", "T"), n_flips, replace = TRUE)
mean(flips == "H")

Exercise P.2: Binomial Distribution

Explore the binomial distribution:

Use rbinom() to simulate 1000 experiments, each with 20 coin flips
Create a histogram of the number of heads per experiment
What is the most common outcome? Does this match your expectation?
Change the probability to simulate an unfair coin
How does the distribution change with 200 or 2000 flips per experiment?

Code

# 1000 experiments, 20 flips each, fair coin
set.seed(42)
results <- rbinom(1000, size = 20, prob = 0.5)
hist(results, breaks = 20, col = "steelblue",
     main = "Number of Heads in 20 Flips",
     xlab = "Number of Heads")

Figure 11.4: Distribution of heads from 1000 simulated experiments of 20 coin flips each

Exercise P.3: The Birthday Problem

Use Monte Carlo simulation to explore the birthday problem:

Write a function that simulates whether any two people in a group share a birthday
Estimate the probability for groups of size 10, 23, and 50
Plot the probability as a function of group size
At what group size does the probability exceed 50%?

Code

# Birthday simulation function
same_birthday <- function(n, B = 10000) {
  matches <- replicate(B, {
    birthdays <- sample(1:365, n, replace = TRUE)
    any(duplicated(birthdays))
  })
  mean(matches)
}

# Test for different group sizes
sizes <- 2:50
probs <- sapply(sizes, same_birthday)
plot(sizes, probs, type = "l",
     xlab = "Group Size", ylab = "Probability of Shared Birthday")
abline(h = 0.5, col = "red", lty = 2)

Exercise P.4: Conditional Probability

Explore conditional probability with card simulations:

Create a virtual deck of 52 cards
Calculate the probability of drawing a King
Given that the first card drawn is a King, what is the probability the second card is also a King?
Use simulation to verify your calculation

Exercise P.5: The Monty Hall Problem

Simulate the Monty Hall problem:

Write a function that simulates one round of the game
Compare the win rate when you stick versus when you switch
Run 10,000 simulations for each strategy
Does switching really double your chances?

11.19 Additional Resources

Irizarry (2019) - A gitbook by a statistician with excellent introductions to key topics in statistical inference
Logan (2010) - A comprehensive introduction to R for statistical analysis
For a detailed reference of common probability distributions, see Chapter 50

# Foundations of Probability {#sec-probability} ```{r} #| echo: false #| message: false library(tidyverse) library(gtools) theme_set(theme_minimal()) ``` ## Why Probability Matters In games of chance, probability has a very intuitive definition. We know what it means that the chance of a pair of dice coming up seven is 1 in 6. However, probability is used much more broadly today, with the word commonly appearing in everyday language. Google's auto-complete of "What are the chances of" gives us: "having twins", "rain today", "getting struck by lightning", and "getting cancer". One goal of this chapter is to help us understand how probability is useful to understand and describe real-world events when performing data analysis. Because knowing how to compute probabilities gives you an edge in games of chance, throughout history many smart individuals—including famous mathematicians such as Cardano, Fermat, and Pascal—spent time and energy thinking through the math of these games. As a result, Probability Theory was born. Probability continues to be highly useful in modern games of chance. For example, in poker, we can compute the probability of winning a hand based on the cards on the table. Casinos rely on probability theory to develop games that almost certainly guarantee a profit. Probability theory is useful in many other contexts and, in particular, in areas that depend on data affected by chance in some way. All of the other chapters in this part build upon probability theory. Knowledge of probability is therefore indispensable for data science. ## Two Interpretations of Probability There are two main ways to think about what probability means. The **frequentist interpretation** views probability as mathematically convenient approximations to long-run relative frequencies. If we say the probability of heads when flipping a fair coin is 0.5, we mean that if we flipped the coin many, many times, about half the flips would come up heads. This interpretation grounds probability in observable, repeatable phenomena. The **subjective (Bayesian) interpretation** views probability as a measure of belief or uncertainty. A probability statement expresses the opinion of some individual regarding how certain an event is to occur, given their current information. This interpretation allows us to assign probabilities to one-time events and to update beliefs as we gather evidence. Both interpretations have their uses, and modern statistics draws on both perspectives. For now, the frequentist interpretation provides good intuition for the concepts we will develop. ## Random Variables and Sample Spaces A **random variable** is a quantity that can take on different values with different probabilities. The outcome of a coin flip, the number of bacterial colonies on a plate, and the expression level of a gene are all random variables. The **sample space** of a random variable is the set of all possible values it can take. For a coin flip, the sample space is {Heads, Tails}. For a die roll, it is {1, 2, 3, 4, 5, 6}. For the concentration of a protein, it might be any non-negative real number. A **probability distribution** describes how likely each value in the sample space is: - For discrete random variables (those that take distinct values), we use a **probability mass function** that gives the probability of each possible value - For continuous random variables (those that can take any value in a range), we use a **probability density function** from which probabilities are calculated by integration One fundamental rule: the probabilities across the entire sample space must sum (or integrate) to 1. Something from the sample space must happen. ## Discrete Probability We start by covering some basic principles related to categorical data. This subset of probability is referred to as *discrete probability*. It will help us understand the probability theory we will later introduce for numeric and continuous data, which is much more common in data science applications. ### Relative Frequency A precise definition of probability can be given by noting all possible outcomes and counting how many satisfy the condition for our event. For example, if we have 2 red beads and 3 blue beads inside an urn and we pick one at random, what is the probability of picking a red one? Our intuition tells us that the answer is 2/5 or 40%. There are five possible outcomes, of which two satisfy the condition necessary for the event "pick a red bead". Since each of the five outcomes has the same chance of occurring, we conclude that the probability is 0.4 for red and 0.6 for blue. A more tangible way to think about the probability of an event is as the proportion of times the event occurs when we repeat the experiment an infinite number of times, independently, and under the same conditions. ### Notation We use the notation $\mbox{Pr}(A)$ to denote the probability of event $A$ happening. We use the very general term *event* to refer to things that can happen when something occurs by chance. In our previous example, the event was "picking a red bead". In a political poll in which we call 100 likely voters at random, an example of an event is "calling 48 Democrats and 52 Republicans". In data science applications, we will often deal with continuous variables. These events will often be things like "is this person taller than 6 feet". In this case, we write events in a more mathematical form: $X \geq 6$. ### Probability Distributions for Categorical Data If we know the relative frequency of the different categories, defining a distribution for categorical outcomes is straightforward. We simply assign a probability to each category. In cases that can be thought of as beads in an urn, for each bead type, their proportion defines the distribution. If we are randomly calling likely voters from a population that is 44% Democrat, 44% Republican, 10% undecided, and 2% Green Party, these proportions define the probability for each group: | Group | Probability | |:------|:------------| | Republican | 0.44 | | Democrat | 0.44 | | Undecided | 0.10 | | Green | 0.02 | ## Distribution Moments and Parameters Probability distributions can be characterized by their **moments**—metrics that describe the shape of the distribution. The first four moments correspond to important properties: 1. **Mean** ($\mu$) - the center or expected value 2. **Variance** ($\sigma^2$) - the spread or dispersion 3. **Skewness** - the asymmetry of the distribution 4. **Kurtosis** - the "tailedness" or peakedness For a discrete random variable X, the **expected value** (mean) is: $$E[X] = \sum_{\text{all } x} x \cdot P(X = x) = \mu$$ The **variance** measures dispersion around the mean: $$\text{Var}(X) = E[(X - \mu)^2] = \sigma^2$$ These parameters are crucial because they describe real properties of the systems we study. In biology, for example, the mean height of a population tells us about the typical value, while the variance tells us about the diversity of heights among individuals. ## Monte Carlo Simulations Computers provide a way to actually perform random experiments. Random number generators permit us to mimic the process of picking at random. An example is the `sample` function in R. First, we use the function `rep` to generate the urn: ```{r} beads <- rep(c("red", "blue"), times = c(2, 3)) beads ``` Then use `sample` to pick a bead at random: ```{r} sample(beads, 1) ``` This line of code produces one random outcome. We want to repeat this experiment a large enough number of times to make the results practically equivalent to repeating forever. **This is an example of a Monte Carlo simulation**. To perform our first Monte Carlo simulation, we use the `replicate` function, which permits us to repeat the same task any number of times. Here, we repeat the random event B = 10,000 times: ```{r} set.seed(1986) # For reproducibility B <- 10000 events <- replicate(B, sample(beads, 1)) ``` We can now see if our definition actually agrees with this Monte Carlo simulation approximation: ```{r} tab <- table(events) prop.table(tab) ``` The numbers above are the estimated probabilities provided by this Monte Carlo simulation. Statistical theory tells us that as $B$ gets larger, the estimates get closer to 3/5 = 0.6 and 2/5 = 0.4. ### With and Without Replacement The function `sample` has an argument that permits us to pick more than one element from the urn. By default, this selection occurs *without replacement*: after a bead is selected, it is not put back in the bag. ```{r} sample(beads, 5) ``` This results in rearrangements that always have three blue and two red beads because we can't select more beads than exist. However, we can sample *with replacement*: return the bead back to the urn after selecting it: ```{r} events <- sample(beads, B, replace = TRUE) prop.table(table(events)) ``` Not surprisingly, we get results very similar to those previously obtained with `replicate`. ## The Bernoulli Distribution The simplest probability distribution describes a single event with two possible outcomes—success or failure, yes or no, heads or tails. This is the **Bernoulli distribution**. Consider flipping a fair coin once: $$P(X = \text{Head}) = \frac{1}{2} = 0.5 = p$$ And the probability of tails is: $$P(X = \text{Tail}) = \frac{1}{2} = 0.5 = 1 - p = q$$ If the coin is not fair, $p$ might differ from 0.5, but the probabilities still sum to 1: $$p + (1-p) = 1$$ This same framework applies to any binary outcome: whether a patient responds to treatment, whether an allele is inherited from a parent, or whether a product passes quality control. ```{r} #| label: fig-bernoulli #| fig-cap: "Simulated Bernoulli distribution from 1000 coin flips showing approximately equal probability for heads and tails" #| fig-width: 5 #| fig-height: 4 # Flip a coin 1000 times set.seed(42) flips <- rbinom(1000, 1, 0.5) barplot(table(flips) / 1000, names.arg = c("Tails", "Heads"), ylab = "Probability", ylim = c(0, 0.75), col = "steelblue", main = "Estimated Bernoulli Distribution") ``` ## Probability Rules Two fundamental rules govern how probabilities combine. Most probability distributions can be built up from these simple rules. ### The AND Rule (Multiplication) The probability that two independent events both occur is the product of their individual probabilities. If you flip a coin twice: $$P(\text{First = Head AND Second = Head}) = p \times p = p^2$$ More generally, for independent events A and B: $$P(A \text{ and } B) = P(A) \times P(B)$$ For a fair coin with $p = 0.5$: - $P(\text{HH}) = 0.5 \times 0.5 = 0.25$ - $P(\text{HT}) = 0.5 \times 0.5 = 0.25$ - $P(\text{TH}) = 0.5 \times 0.5 = 0.25$ - $P(\text{TT}) = 0.5 \times 0.5 = 0.25$ ### The OR Rule (Addition) The probability that at least one of two mutually exclusive events occurs is the sum of their probabilities: $$P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B)$$ For mutually exclusive events (those that cannot both occur), the intersection is empty: $$P(A \text{ or } B) = P(A) + P(B)$$ The probability of getting exactly one head in two flips (either HT or TH): $$P(\text{one head}) = P(\text{HT}) + P(\text{TH}) = 0.25 + 0.25 = 0.5$$ ### Multiplication Rule Under Independence When events are independent, the multiplication rule simplifies: $$P(A \text{ and } B \text{ and } C) = P(A) \times P(B) \times P(C)$$ But we must be careful—assuming independence can result in very different and incorrect probability calculations when we don't actually have independence. ::: {.callout-warning} ## Independence Matters Imagine a court case where the suspect was described as having a mustache and a beard. The defendant has both, and the prosecution brings in an "expert" who testifies that 1/10 men have beards and 1/5 have mustaches, concluding that only $1/10 \times 1/5 = 0.02$ have both. But to multiply like this we need to assume independence! If the conditional probability of a man having a mustache given that he has a beard is 0.95, then the correct probability is much higher: $1/10 \times 0.95 = 0.095$. ::: ## Independence We say two events are **independent** if the outcome of one does not affect the other. The classic example is coin tosses. Every time we toss a fair coin, the probability of seeing heads is 1/2 regardless of what previous tosses have revealed. The same is true when we pick beads from an urn with replacement. Many examples of events that are not independent come from card games. When we deal the first card, the probability of getting a King is 1/13 since there are thirteen possibilities. Now if we deal a King for the first card and don't replace it into the deck, the probability of a second card being a King is only 3/51. These events are **not independent**: the first outcome affected the next one. ```{r} # Demonstrate non-independence with sequential draws set.seed(1) x <- sample(beads, 5) x[1:4] # First four draws x[5] # If first four are blue, the fifth must be... ``` If you have to guess the color of the first bead, you would predict blue since blue has a 60% chance. But if you know the first four were blue, the probability of the fifth being red is now 100%, not 40%. The events are not independent, so the probabilities change. ## Conditional Probabilities When events are not independent, *conditional probabilities* are useful. The conditional probability $P(B|A)$ is the probability of B given that A has occurred: $$P(\text{Card 2 is a King} \mid \text{Card 1 is a King}) = \frac{3}{51}$$ We use the $\mid$ as shorthand for "given that" or "conditional on". When two events A and B are independent: $$P(A \mid B) = P(A)$$ This is the mathematical definition of independence: the fact that B happened does not affect the probability of A happening. The general multiplication rule relates joint and conditional probability: $$P(A \text{ and } B) = P(A) \times P(B \mid A)$$ This can be extended to more events: $$P(A \text{ and } B \text{ and } C) = P(A) \times P(B \mid A) \times P(C \mid A \text{ and } B)$$ ## Bayes' Theorem Rearranging the multiplication rule yields Bayes' theorem, a cornerstone of probabilistic reasoning: $$P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}$$ Bayes' theorem tells us how to update our beliefs about A after observing B. In Bayesian statistics, this is written as: $$P(\theta|d) = \frac{P(d|\theta) \times P(\theta)}{P(d)}$$ where: - $P(\theta|d)$ = posterior probability distribution - $P(d|\theta)$ = likelihood function for $\theta$ - $P(\theta)$ = prior probability distribution - $P(d)$ = marginal likelihood (normalizing constant) ## Likelihood vs. Probability A subtle but important distinction exists between probability and likelihood. **Probability** is the chance of observing particular data given a model or parameter value. If we know a coin has $p = 0.5$, what is the probability of observing 7 heads in 10 flips? **Likelihood** is how well a parameter value explains observed data. Given that we observed 7 heads in 10 flips, how likely is it that the true probability is $p = 0.5$ versus $p = 0.7$? Mathematically, the likelihood function uses the same formula as probability, but we think of it differently: $$L(\text{parameter} | \text{data}) = P(\text{data} | \text{parameter})$$ Maximum likelihood estimation finds the parameter value that makes the observed data most probable—the value that maximizes the likelihood function. ## Combinations and Permutations For more complicated probability calculations, we need to count possibilities systematically. The `gtools` package provides useful functions. ### Permutations (Order Matters) A **permutation** is an arrangement where order matters. For any list of size `n`, the `permutations` function computes all different arrangements when selecting `r` items: ```{r} # All ways to arrange 2 items from {1, 2, 3} permutations(3, 2) ``` Notice that order matters: 3,1 is different than 1,3. Also, (1,1), (2,2), and (3,3) do not appear because once we pick a number, it can't appear again. ### Combinations (Order Doesn't Matter) A **combination** is a selection where order doesn't matter: ```{r} combinations(3, 2) ``` The outcome (2,1) doesn't appear because (1,2) already represents the same combination. ### Example: Blackjack Let's compute the probability of getting a "Natural 21" in Blackjack—an Ace and a face card in the first two cards: ```{r} # Build a deck suits <- c("Diamonds", "Clubs", "Hearts", "Spades") numbers <- c("Ace", "Deuce", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten", "Jack", "Queen", "King") deck <- expand.grid(number = numbers, suit = suits) deck <- paste(deck$number, deck$suit) # Define aces and face cards aces <- paste("Ace", suits) facecard <- c("King", "Queen", "Jack", "Ten") facecard <- expand.grid(number = facecard, suit = suits) facecard <- paste(facecard$number, facecard$suit) # All possible two-card hands (order doesn't matter) hands <- combinations(52, 2, v = deck) # Probability of Natural 21 mean((hands[,1] %in% aces & hands[,2] %in% facecard) | (hands[,2] %in% aces & hands[,1] %in% facecard)) ``` ## Classic Examples ### The Monty Hall Problem In the game show "Let's Make a Deal," contestants pick one of three doors. Behind one door is a car; behind the others are goats. After you pick a door, Monty Hall opens one of the remaining doors to reveal a goat. Then he asks: "Do you want to switch doors?" Intuition suggests it shouldn't matter—you're choosing between two doors, so shouldn't the probability be 50-50? Let's use Monte Carlo simulation: ```{r} B <- 10000 monty_hall <- function(strategy) { doors <- as.character(1:3) prize <- sample(c("car", "goat", "goat")) prize_door <- doors[prize == "car"] my_pick <- sample(doors, 1) show <- sample(doors[!doors %in% c(my_pick, prize_door)], 1) if (strategy == "stick") { choice <- my_pick } else { choice <- doors[!doors %in% c(my_pick, show)] } choice == prize_door } stick_wins <- replicate(B, monty_hall("stick")) switch_wins <- replicate(B, monty_hall("switch")) cat("Probability of winning when sticking:", mean(stick_wins), "\n") cat("Probability of winning when switching:", mean(switch_wins), "\n") ``` Switching doubles your chances! The key insight: when you first pick, you have a 1/3 chance of being right. Monty's reveal doesn't change that. Since the probability the car is behind one of the other doors was 2/3, and Monty showed you which one doesn't have it, switching gives you that 2/3 probability. ### The Birthday Problem In a room with 50 people, what's the probability that at least two share a birthday? ```{r} #| label: fig-birthday-problem #| fig-cap: "The probability of at least two people sharing a birthday increases rapidly with group size, exceeding 50% at just 23 people" #| fig-width: 6 #| fig-height: 4 # Monte Carlo simulation B <- 10000 same_birthday <- function(n) { bdays <- sample(1:365, n, replace = TRUE) any(duplicated(bdays)) } results <- replicate(B, same_birthday(50)) cat("Probability with 50 people:", mean(results), "\n") # How does this change with group size? compute_prob <- function(n, B = 10000) { results <- replicate(B, same_birthday(n)) mean(results) } n <- seq(1, 60) prob <- sapply(n, compute_prob) qplot(n, prob, geom = "line") + geom_hline(yintercept = 0.5, linetype = "dashed", color = "red") + labs(x = "Group Size", y = "Probability of Shared Birthday", title = "The Birthday Problem") ``` With just 23 people, there's already a 50% chance of a shared birthday! People tend to underestimate these probabilities because they think about the probability that someone shares *their* birthday, not the probability that *any two people* share a birthday. We can also compute this exactly using the multiplication rule: ```{r} # Probability that all n people have UNIQUE birthdays exact_prob <- function(n) { prob_unique <- seq(365, 365 - n + 1) / 365 1 - prod(prob_unique) } eprob <- sapply(n, exact_prob) cat("Exact probability with 50 people:", exact_prob(50), "\n") ``` ## Covariance and Correlation When two variables are not independent, they share information—knowing one tells you something about the other. This shared information is quantified by **covariance**, a measure of how two variables vary together. If high values of X tend to occur with high values of Y (and low with low), the covariance is positive. If high values of X tend to occur with low values of Y, the covariance is negative. If there is no relationship, the covariance is near zero. **Correlation** is covariance standardized to fall between -1 and 1, making it easier to interpret. A correlation of 1 means perfect positive linear relationship; -1 means perfect negative linear relationship; 0 means no linear relationship. These concepts will become central when we discuss regression and other methods for relating variables to each other. ## How Large is "Large Enough" for Monte Carlo? The theory described here requires repeating experiments over and over forever. In practice, we can't do this. In the examples above, we used $B = 10,000$ Monte Carlo experiments and it turned out to provide accurate estimates. One practical approach is to check for the stability of the estimate: ```{r} #| label: fig-monte-carlo-convergence #| fig-cap: "Monte Carlo estimates converge to the true probability (red dashed line) as the number of simulations increases" #| fig-width: 6 #| fig-height: 4 B_values <- 10^seq(1, 5, len = 100) compute_prob_B <- function(B, n = 25) { same_day <- replicate(B, same_birthday(n)) mean(same_day) } prob <- sapply(B_values, compute_prob_B) qplot(log10(B_values), prob, geom = "line") + geom_hline(yintercept = exact_prob(25), color = "red", linetype = "dashed") + labs(x = "log10(Number of Simulations)", y = "Estimated Probability", title = "Monte Carlo Convergence") ``` The values start to stabilize (vary less than 0.01) around 1000 simulations. The exact probability is shown in red. ## Summary This chapter introduced the language of probability: - **Random variables** can take different values with different probabilities - The **sample space** contains all possible outcomes - **Probability distributions** describe how likely each outcome is - The **AND rule** (multiply) and **OR rule** (add) combine probabilities - **Independence** means one event doesn't affect another's probability - **Conditional probability** describes probability given other information - **Bayes' theorem** updates beliefs based on new evidence - **Monte Carlo simulations** estimate probabilities through repeated random sampling - Classic problems like Monty Hall and birthdays reveal counterintuitive probability results Understanding these foundations is essential for all statistical inference that follows. ## Practice Exercises ::: {.callout-note} ### Exercise P.1: Simulating Coin Flips Use R to simulate coin flips: ```{r} set.seed(123) ``` 1. Simulate 100 fair coin flips using `rbinom()` or `sample()` 2. Calculate the proportion of heads 3. Repeat with 1000 and 10000 flips—how does the proportion change? 4. Create a histogram of results from many simulations ```{r} #| eval: false # Simulate coin flips n_flips <- 1000 flips <- rbinom(n_flips, size = 1, prob = 0.5) mean(flips) # Proportion of heads (1s) # Or using sample flips <- sample(c("H", "T"), n_flips, replace = TRUE) mean(flips == "H") ``` ::: ::: {.callout-note} ### Exercise P.2: Binomial Distribution Explore the binomial distribution: 1. Use `rbinom()` to simulate 1000 experiments, each with 20 coin flips 2. Create a histogram of the number of heads per experiment 3. What is the most common outcome? Does this match your expectation? 4. Change the probability to simulate an unfair coin 5. How does the distribution change with 200 or 2000 flips per experiment? ```{r} #| label: fig-binomial-sim #| fig-cap: "Distribution of heads from 1000 simulated experiments of 20 coin flips each" #| fig-width: 6 #| fig-height: 4 # 1000 experiments, 20 flips each, fair coin set.seed(42) results <- rbinom(1000, size = 20, prob = 0.5) hist(results, breaks = 20, col = "steelblue", main = "Number of Heads in 20 Flips", xlab = "Number of Heads") ``` ::: ::: {.callout-note} ### Exercise P.3: The Birthday Problem Use Monte Carlo simulation to explore the birthday problem: 1. Write a function that simulates whether any two people in a group share a birthday 2. Estimate the probability for groups of size 10, 23, and 50 3. Plot the probability as a function of group size 4. At what group size does the probability exceed 50%? ```{r} #| eval: false # Birthday simulation function same_birthday <- function(n, B = 10000) { matches <- replicate(B, { birthdays <- sample(1:365, n, replace = TRUE) any(duplicated(birthdays)) }) mean(matches) } # Test for different group sizes sizes <- 2:50 probs <- sapply(sizes, same_birthday) plot(sizes, probs, type = "l", xlab = "Group Size", ylab = "Probability of Shared Birthday") abline(h = 0.5, col = "red", lty = 2) ``` ::: ::: {.callout-note} ### Exercise P.4: Conditional Probability Explore conditional probability with card simulations: 1. Create a virtual deck of 52 cards 2. Calculate the probability of drawing a King 3. Given that the first card drawn is a King, what is the probability the second card is also a King? 4. Use simulation to verify your calculation ::: ::: {.callout-note} ### Exercise P.5: The Monty Hall Problem Simulate the Monty Hall problem: 1. Write a function that simulates one round of the game 2. Compare the win rate when you stick versus when you switch 3. Run 10,000 simulations for each strategy 4. Does switching really double your chances? ::: ## Additional Resources - @irizarry2019introduction - A gitbook by a statistician with excellent introductions to key topics in statistical inference - @logan2010biostatistical - A comprehensive introduction to R for statistical analysis - For a detailed reference of common probability distributions, see @sec-probability-distributions