Lecture 3 — Conjugate Models & Choosing Your Prior (Ch. 5)

Author

Johnny van Doorn

Published

June 8, 2026

By the end of this lecture, you will be able to:

Explain what conjugacy means and why it is useful
Apply the Beta–Binomial, Gamma–Poisson, and Normal–Normal update rules
Choose a prior using effective sample size, moment matching, or a prior predictive check
Run a prior sensitivity analysis in R

Recommended Reading: Bayes Rules! Chapter 5. Suggested exercises (worked solutions available): 5.1, 5.2, 5.3, 5.7, 5.8, 5.9, 5.10.

Conjugate Models (Ch. 5)

What Is Conjugacy?

\[f(\theta \mid y) \;\propto\; f(\theta) \cdot L(y \mid \theta)\]

A conjugate prior is one where the posterior belongs to the same distributional family as the prior — the hyperparameters simply absorb the data.

Advantage	Why it matters
Closed-form posterior	No numerical integration needed
Transparent update	Hyperparameters = prior pseudo-data
Easy sensitivity analysis	Just change hyperparameters

The Beta–Binomial Model

Research Context

A digital CBT app is evaluated. What proportion \(\theta\) of patients show meaningful improvement?

\(\theta \in [0,1]\): probability of improvement
\(Y \mid \theta \sim \text{Binomial}(n, \theta)\)
Prior: \(\theta \sim \text{Beta}(\alpha, \beta)\)

Prior literature suggests ~60% success. A pilot of 20 gives: \(\alpha = 12\), \(\beta = 8\) (ESS = 20, mean = 0.60).

The Update Rule

\[\theta \sim \text{Beta}(\alpha, \beta) \qquad Y \mid \theta \sim \text{Bin}(n, \theta)\]

\[\boxed{\theta \mid y \;\sim\; \text{Beta}(\alpha + y,\;\; \beta + n - y)}\]

	Prior	Posterior
“Successes”	\(\alpha\)	\(\alpha + y\)
“Failures”	\(\beta\)	\(\beta + n - y\)
Mean	\(\alpha / (\alpha+\beta)\)	\((\alpha+y) / (\alpha+\beta+n)\)

R Code

Code

library(bayesrules)

# Prior Beta(12, 8),  data: y = 14 successes out of n = 20
plot_beta_binomial(alpha = 12, beta = 8, y = 14, n = 20)

Code

summarize_beta_binomial(alpha = 12, beta = 8, y = 14, n = 20)

Code

# Posterior probability: P(theta > 0.70 | y)
pbeta(0.70, shape1 = 26, shape2 = 14, lower.tail = FALSE)

[1] 0.2602812

The Gamma–Poisson Model

Research Context

A patient with OCD records intrusive thoughts each day for 7 days. What is the daily rate \(\lambda\)?

\(\lambda > 0\): mean daily count
\(Y_i \mid \lambda \overset{\text{iid}}{\sim} \text{Poisson}(\lambda)\)
Prior: \(\lambda \sim \text{Gamma}(s, r)\)

Literature suggests ~5 thoughts/day → \(s = 5\), \(r = 1\) (mean = 5, ESS \(\approx\) 1 observation period).

The Update Rule

\[\lambda \sim \text{Gamma}(s, r) \qquad Y_i \mid \lambda \overset{\text{iid}}{\sim} \text{Poisson}(\lambda)\]

\[\boxed{\lambda \mid \mathbf{y} \;\sim\; \text{Gamma}\!\left(s + \sum y_i,\;\; r + n\right)}\]

	Shape	Rate	Mean
Prior	\(s\)	\(r\)	\(s/r\)
Posterior	\(s + \sum y_i\)	\(r + n\)	\((s + \sum y_i)/(r+n)\)

R Code

Code

library(bayesrules)

# Prior Gamma(5, 1),  data: 7 daily counts
y_vec <- c(6, 2, 2, 1, 3, 5, 3)

plot_gamma_poisson(shape = 5, rate = 1,
                   sum_y = sum(y_vec), n = length(y_vec))

Code

summarize_gamma_poisson(shape = 5, rate = 1,
                        sum_y = sum(y_vec), n = length(y_vec))

The Normal–Normal Model

Research Context

Stroop task. Literature: mean RT ~650 ms, known measurement error \(\sigma = 50\) ms. You collect \(n = 5\) participants.

\(\mu\): true mean RT
\(Y_i \mid \mu \overset{\text{iid}}{\sim} N(\mu, \sigma^2)\), \(\sigma\) known
Prior: \(\mu \sim N(650, 100^2)\)

The Update Rule

The posterior mean is a precision-weighted average of prior and data:

\[\mu_\text{post} = w_\text{prior} \cdot \mu_0 + w_\text{data} \cdot \bar{y}\]

where \(w_\text{prior} = \frac{\sigma^2}{n\tau^2 + \sigma^2}\) and \(w_\text{data} = \frac{n\tau^2}{n\tau^2 + \sigma^2}\).

As \(n\) grows, \(w_\text{data} \to 1\) and the posterior concentrates on \(\bar{y}\).

Notation convention. Throughout this lecture and Formula 5.12 in the book, \(\mathcal{N}(\cdot, \cdot)\) uses (mean, variance). The textbook occasionally writes the second slot as SD instead — when it does, square it before plugging into the update formulas. (R is the opposite: dnorm and the bayesrules functions take sd =, not variance.)

R Code

Code

library(bayesrules)

y_vec <- c(620, 590, 680, 640, 610)

plot_normal_normal(mean = 650, sd = 100,
                   sigma = 50, y_bar = mean(y_vec), n = 5)

Code

summarize_normal_normal(mean = 650, sd = 100,
                        sigma = 50, y_bar = mean(y_vec), n = 5)

Choosing Your Prior

Three Strategies

Strategy 1: Effective Sample Size (ESS)

Read hyperparameters as pseudo-data. ESS tells you how much prior information you’re asserting.

Beta(\(\alpha, \beta\)): ESS \(= \alpha + \beta\). Beta(12, 8) acts like 20 prior observations.
Gamma(\(s, r\)): ESS \(\approx r\) observation periods.
Normal(\(\mu_0, \tau^2\)): ESS \(\approx \sigma^2 / \tau^2\) observations.

Rule of thumb: keep ESS small relative to \(n\) unless you have substantial prior evidence.

Strategy 2: Moment Matching

Fix a target mean \(\mu_0\) and SD \(\sigma_0\), then solve for hyperparameters.

Beta: \[\text{ESS} = \frac{\mu_0(1-\mu_0)}{\sigma_0^2} - 1, \quad \alpha = \mu_0 \cdot \text{ESS}, \quad \beta = (1-\mu_0) \cdot \text{ESS}\]

Gamma: \[s = \frac{\mu_0^2}{\sigma_0^2}, \quad r = \frac{\mu_0}{\sigma_0^2}\]

Example: Expect ~70% success, SD = 0.10 → ESS = 20, \(\alpha = 14\), \(\beta = 6\) → Beta(14, 6).

Strategy 3: Prior Predictive Check

Simulate what the prior predicts before seeing data. If the simulated data look implausible, revise the prior.

Draw \(\theta^{(i)}\) from the prior
Simulate \(y^{(i)}\) using the binomial PMF at each \(\theta^{(i)}\)
Plot: does this look like realistic study data?

Prior Predictive Check in R

Code

# Prior Beta(12, 8), planning n = 20
theta_prior  <- rbeta(1e4, shape1 = 12, shape2 = 8)
y_prior_pred <- rbinom(1e4, size = 20, prob = theta_prior)

hist(y_prior_pred, breaks = 0:20,
     main = "Prior Predictive Distribution (Beta(12,8), n=20)",
     xlab = "Number of successes (out of 20)",
     col = "steelblue", border = "white")

Prior Sensitivity Analysis

Code

# Same data: y = 14, n = 20. Three different priors.
priors <- list(
  "Beta(12,8)"  = c(12, 8),
  "Beta(2,2)"   = c(2,  2),
  "Beta(1,1)"   = c(1,  1)
)

for (nm in names(priors)) {
  a <- priors[[nm]][1]
  b <- priors[[nm]][2]
  post_mean <- (a + 14) / (a + b + 20)
  cat(nm, "→ posterior mean:", round(post_mean, 3), "\n")
}

Beta(12,8) → posterior mean: 0.65 
Beta(2,2) → posterior mean: 0.667 
Beta(1,1) → posterior mean: 0.682

No prior is “objective.” Beta(1,1) has ESS = 2 — it asserts that any success probability is equally plausible, from 0.001 to 0.999. That is itself an assumption, not a neutral stance.

In-Class Exercises

Exercise 1: Identify the Family

For each scenario below, identify the appropriate model (likelihood + prior family) and write the posterior update formula.

Scenario	Data type	Model?
% of phobia patients in remission after therapy	Binary	?
# of nightmares per week	Counts	?
Average BDI depression score (assume known SD)	Continuous	?
# correct in a memory task (out of 30 words)	Binary	?

Write:

The likelihood \(Y \mid \theta \sim \ldots\)
A conjugate prior \(\theta \sim \ldots\)
The posterior update rule

Scenario	Likelihood	Prior	Posterior
Remission rate	Binomial\((n, \theta)\)	Beta\((\alpha, \beta)\)	Beta\((\alpha+y, \beta+n-y)\)
Nightmares/week	Poisson\((\lambda)\)	Gamma\((s, r)\)	Gamma\((s+\sum y_i, r+n)\)
Mean BDI	Normal\((\mu, \sigma^2)\)	Normal\((\mu_0, \tau^2)\)	Normal\((\mu_\text{post}, \sigma^2_\text{post})\)
Memory task	Binomial\((30, \theta)\)	Beta\((\alpha, \beta)\)	Beta\((\alpha+y, \beta+30-y)\)

Exercise 2: Placebo Effect Study

18 out of 25 participants report lower pain after receiving a sugar pill.

Choose a Beta\((\alpha, \beta)\) prior. Justify your choice (use ESS reasoning).
Run a prior predictive check.
Derive and summarise the posterior using bayesrules.
Compute \(P(\theta > 0.6 \mid y)\).
Repeat under Beta(1,1) — how much do conclusions change?

Code

library(bayesrules)

# 1. Prior: literature suggests placebo effects around 50–60%
#    Use Beta(6, 4): mean = 0.60, ESS = 10 (modest prior knowledge)
alpha_prior <- 6; beta_prior <- 4

# 2. Prior predictive check
theta_sim  <- rbeta(1e4, alpha_prior, beta_prior)
y_sim      <- rbinom(1e4, size = 25, prob = theta_sim)
hist(y_sim, breaks = 0:25, main = "Prior predictive check",
     xlab = "Simulated successes (n=25)", col = "skyblue", border = "white")

Code

# 3. Posterior: Beta(6+18, 4+7) = Beta(24, 11)
plot_beta_binomial(alpha = alpha_prior, beta = beta_prior, y = 18, n = 25)

Code

summarize_beta_binomial(alpha = alpha_prior, beta = beta_prior, y = 18, n = 25)

Code

# 4. P(theta > 0.60 | y)
pbeta(0.60, 24, 11, lower.tail = FALSE)

[1] 0.8616852

Code

# 5. Sensitivity: Beta(1,1)
summarize_beta_binomial(alpha = 1, beta = 1, y = 18, n = 25)

Code

pbeta(0.60, 19, 8, lower.tail = FALSE)

[1] 0.8784429

Interpretation: Under Beta(6,4), posterior mean ≈ 0.69, with strong evidence \(\theta > 0.60\). Under the flat Beta(1,1), the posterior shifts only slightly (the data dominate). Conclusions are robust.

Exercise 3: Choosing a Prior via Moment Matching

A cognitive psychologist expects Stroop interference to add about 80 ms to reaction time, with considerable uncertainty (SD ≈ 30 ms). Measurement error is known to be \(\sigma = 50\) ms.

Use moment matching to find the Normal prior \(\mu \sim N(\mu_0, \tau^2)\).
The researcher collects RTs: {720, 695, 740, 710, 680} ms. Derive the posterior.
Compute the posterior mean and compare it to the prior mean and sample mean.
How would the posterior change if the prior SD were 100 ms instead of 30 ms? What does this tell you about prior informativeness?

Code

library(bayesrules)

# 1. Prior from moment matching: mu_0 = 650 + 80 = 730, tau = 30
mu0  <- 730
tau  <- 30
sigma <- 50

# 2. Data
y <- c(720, 695, 740, 710, 680)
n <- length(y)
ybar <- mean(y)
cat("Sample mean:", ybar, "\n")

Sample mean: 709

Code

# Posterior via Normal-Normal update
sigma2 <- sigma^2; tau2 <- tau^2
sigma2_post <- 1 / (n / sigma2 + 1 / tau2)
mu_post     <- sigma2_post * (n * ybar / sigma2 + mu0 / tau2)
cat("Posterior mean:", round(mu_post, 1), "\n")

Posterior mean: 716.5

Code

cat("Posterior SD:",   round(sqrt(sigma2_post), 1), "\n")

Posterior SD: 17.9

Code

# 3. Comparison
cat("\nPrior mean:", mu0, "| Sample mean:", ybar, "| Posterior mean:", round(mu_post, 1), "\n")


Prior mean: 730 | Sample mean: 709 | Posterior mean: 716.5

Code

# 4. Vague prior (tau = 100)
tau2_vague   <- 100^2
sigma2_post2 <- 1 / (n / sigma2 + 1 / tau2_vague)
mu_post2     <- sigma2_post2 * (n * ybar / sigma2 + mu0 / tau2_vague)
cat("\nVague prior: posterior mean:", round(mu_post2, 1),
    " | posterior SD:", round(sqrt(sigma2_post2), 1), "\n")


Vague prior: posterior mean: 710  | posterior SD: 21.8

Interpretation: With a tight prior (SD = 30), the posterior is pulled toward 730 ms. With a vague prior (SD = 100), it follows the data much more closely. The prior’s ESS (\(\approx \sigma^2/\tau^2\)) drives this difference.

Next lecture (June 10): When conjugacy doesn’t deliver a closed-form posterior, we approximate it instead — grid approximation for low-dimensional problems and MCMC for everything else.