Lecture 4 — Obtaining the Posterior: Grid Approximation & MCMC (Ch. 6 & 7)

Author

Johnny van Doorn

Published

June 10, 2026

By the end of this lecture, you will be able to:

  • Explain why conjugate models are limited and when you must go beyond them
  • Implement grid approximation to obtain posterior samples for non-conjugate models
  • Explain the intuition behind Markov Chain Monte Carlo (MCMC) and the Metropolis algorithm
  • Read the key MCMC convergence diagnostics (trace plots, \(\hat R\), effective sample size)
  • Use posterior samples — regardless of how they were obtained — to compute summaries

Recommended Reading: Bayes Rules! Chapters 6 (approximating the posterior) and 7 (MCMC under the hood). Suggested exercises: Ch. 6 — 2, 3, 8; Ch. 7 — 4, 9.

On the exam you don’t need to implement samplers by hand or use Stan, but you should understand the intuition and be able to interpret diagnostics.


Review: Four Ingredients of a Bayesian Analysis

Every Bayesian model is built on the same four components, regardless of model complexity.

The four ingredients

  1. Prior model — beliefs about parameter \(\theta\) before seeing data: \(f(\theta)\)
  2. Data model (likelihood) — how data arise given the parameter: \(f(y \mid \theta)\)
  3. Likelihood function — the same expression viewed as a function of \(\theta\) with \(y\) fixed: \(L(\theta \mid y) \propto f(y \mid \theta)\)
  4. Posterior model — updated beliefs after seeing data: \(f(\theta \mid y) \propto f(\theta) \cdot L(\theta \mid y)\)

The three conjugate families from last lecture:

Likelihood Prior Posterior
Binomial\((n, \theta)\) Beta\((\alpha, \beta)\) Beta\((\alpha+y,\;\beta+n-y)\)
Poisson\((\lambda)\) Gamma\((s, r)\) Gamma\((s+\sum y_i,\; r+n)\)
Normal\((\mu, \sigma^2)\), \(\sigma\) known Normal\((\mu_0, \tau^2)\) Normal\((\mu_\text{post}, \sigma^2_\text{post})\)

Today we ask: what happens when the model is not conjugate, or when the posterior simply has no closed form? We turn to numerical approximations — grid approximation and MCMC.


Running Example: Dutch Railways

We will use one running example throughout this lecture, starting with a conjugate set-up and then breaking it.

Scenario

You are a data analyst for the Dutch Railway Service (NS). Your task: model the monthly rate of signal failures on a 10 km stretch of track.

  • Parameter: \(\lambda =\) mean failures per month
  • Likelihood: \(Y_i \mid \lambda \overset{\text{iid}}{\sim} \text{Poisson}(\lambda)\)
  • Prior: Gamma(2, 0.5) — failures are relatively rare, modest prior certainty (mean = 4)
Code
library(bayesrules)

# Observed 4 months: y = (2, 0, 3, 1)
y <- c(2, 0, 3, 1)

# Posterior: Gamma(s + sum(y), r + n) = Gamma(8, 4.5)
s_post <- 2 + sum(y)
r_post <- 0.5 + length(y)
cat("Posterior: Gamma(", s_post, ",", r_post, ")\n")
Posterior: Gamma( 8 , 4.5 )
Code
plot_gamma_poisson(shape = 2, rate = 0.5, sum_y = sum(y), n = length(y))

Code
summarize_gamma_poisson(shape = 2, rate = 0.5, sum_y = sum(y), n = length(y))

So far so familiar — the Gamma–Poisson is conjugate, and the posterior is a clean Gamma(8, 4.5). But what if the engineer’s prior beliefs aren’t Gamma-shaped? Maybe she’s confident that \(\lambda\) lies in \([1, 5]\) and is unwilling to commit to a particular shape within that range. Or maybe she has two competing hypotheses about the maintenance regime that suggest a bimodal prior. The conjugacy trick breaks — but the mathematics

\[f(\lambda \mid y) \propto f(\lambda) \cdot L(y \mid \lambda)\]

still holds. We just can’t read off the answer in closed form. That’s the puzzle the rest of this lecture solves.


Beyond Conjugacy: Grid Approximation & MCMC

Conjugate models require prior and likelihood to match. Real research often involves:

  • Multiple unknown parameters
  • Non-standard prior shapes (truncated, mixture)
  • Complex hierarchical models

The good news: \(f(\theta \mid y) \propto f(\theta) \cdot L(y \mid \theta)\) always holds. We just need numerical methods to compute usable summaries.

Grid Approximation

  1. Define a grid: \(\theta_1, \ldots, \theta_K\)
  2. Evaluate unnormalised posterior at each point: \(u_i = f(\theta_i) \cdot L(y \mid \theta_i)\)
  3. Normalise: \(\hat{f}(\theta_i \mid y) = u_i / \sum_j u_j\)
  4. Sample from the grid with probabilities proportional to \(\hat{f}\)
Code
# Non-conjugate example: Beta(3,3) prior truncated to [0.4, 1]
theta_grid <- seq(0, 1, length.out = 2000)

prior      <- dbeta(theta_grid, 3, 3) * (theta_grid >= 0.4)
likelihood <- dbinom(18, size = 25, prob = theta_grid)

unnorm         <- prior * likelihood
posterior_grid <- unnorm / sum(unnorm)

theta_samples <- sample(theta_grid, size = 1e4,
                        replace = TRUE, prob = posterior_grid)

hist(theta_samples, breaks = 50, probability = TRUE,
     main = "Grid approx: truncated Beta(3,3) prior, y=18, n=25",
     xlab = expression(theta), col = "steelblue", border = "white")

Code
cat("Posterior mean:", round(mean(theta_samples), 3), "\n")
Posterior mean: 0.679 
Code
cat("95% ETI: [", round(quantile(theta_samples, 0.025), 3),
    ",", round(quantile(theta_samples, 0.975), 3), "]\n")
95% ETI: [ 0.509 , 0.825 ]

Limitation: Grid points grow as \(K^p\) with the number of parameters \(p\). With 10 parameters and 100 grid points each, you need \(100^{10} = 10^{20}\) evaluations — infeasible.

MCMC — The Intuition

Markov Chain Monte Carlo draws samples from the posterior without computing the normalising constant.

The island-hopping metaphor (Metropolis algorithm)

A politician visits islands whose populations represent posterior density:

  1. Propose a neighbouring island at random
  2. If it’s bigger → always move
  3. If it’s smaller → move with probability = (proposed / current population)
  4. Otherwise → stay put

Long-run time spent on each island \(\propto\) population — without ever needing the total.

The politician and the unknown total

The metaphor maps exactly onto the posterior:

Island world Posterior
an island a parameter value \(\lambda\)
island population plausibility \(f(\lambda)\,L(\lambda \mid y)\) (prior × likelihood)
total population of all islands normalising constant \(f(y)\)

The politician can only count the people on his current island and on the neighbour he proposes. He never takes a world census — that would mean visiting every island and summing them, i.e. computing \(f(y) = \int f(\lambda)\,L(\lambda \mid y)\,d\lambda\), the intractable integral.

And he doesn’t need it. Standing on an island of 1,000, proposing one of 600:

\[\alpha = \frac{600}{1000} = 0.6 .\]

The “true” population shares are \(600/T\) and \(1000/T\) (with \(T\) the world total), but their ratio is \(\tfrac{600/T}{1000/T} = 0.6\) — the unknown total \(T\) cancels. Two local head-counts are all he needs.

Why he still visits the small islands

His rule: a bigger neighbour → always go; a smaller neighbour → go with probability equal to the size ratio.

Why not just “always move to the bigger island”? Then he would climb to the single largest island and never leave — reporting only the posterior’s peak (the mode) and missing its entire spread.

Accepting smaller islands in proportion to their size is what fixes this:

  • big islands → visited often
  • a neighbour 60% as big → visited about 60% as often
  • tiny islands → rarely, but not never

The payoff: over a long tour, the time spent on each island is proportional to its population — i.e. to the posterior plausibility. So the histogram of where he has been reconstructs the whole posterior, in correct proportions, without his ever knowing the total.

The acceptance rule

The metaphor becomes exact through the acceptance probability. Standing at the current value \(\mu\), we propose a new value \(\mu'\) from a (symmetric) proposal model and then accept the move with probability

\[\alpha = \min\!\left\{1,\; \frac{f(\mu')\,L(\mu' \mid y)}{f(\mu)\,L(\mu \mid y)}\right\}.\]

Here is the crucial trick. Divide the numerator and denominator by the intractable normalising constant \(f(y)\) — it cancels:

\[\alpha = \min\!\left\{1,\; \frac{f(\mu' \mid y)}{f(\mu \mid y)}\right\}.\]

So the accept/reject decision depends only on the ratio of (unnormalised) posterior plausibilities, which is exactly what we can evaluate as \(f(\mu)\,L(\mu \mid y)\). This is the whole reason MCMC escapes the normalising-constant problem that broke conjugacy.

Two scenarios — the “bigger / smaller island” rules made precise

  • Proposal at least as plausible (\(f(\mu' \mid y) \ge f(\mu \mid y)\)): \(\alpha = 1\), so we always move.
  • Proposal less plausible (\(f(\mu' \mid y) < f(\mu \mid y)\)): move with probability \(\alpha = f(\mu' \mid y) / f(\mu \mid y) < 1\). Bolder downhill jumps are accepted less often, so the chain still visits the tails occasionally but spends most of its time where the posterior mass is.

With a symmetric proposal — like the Uniform \(\mu' \mid \mu \sim \text{Unif}(\mu - w, \mu + w)\) used in Exercises 7.4 and 7.9 — the proposal terms cancel and we get exactly the Metropolis rule above. Asymmetric proposals carry an extra factor \(q(\mu \mid \mu')/q(\mu' \mid \mu)\) (the full Metropolis–Hastings rule).


The Algorithm, Step by Step (Dutch Railways)

Let’s run the algorithm on our signal-failure data \(y = (2, 0, 3, 1)\) with the Gamma(2, 0.5) prior. We already know the posterior is Gamma(8, 4.5), so we can hold the sampler up against the truth. The only ingredient we need is the unnormalised log-posterior, \(\log\!\big[f(\lambda)\,L(\lambda \mid y)\big]\):

Code
y <- c(2, 0, 3, 1)

# Unnormalised log-posterior for the NS rate lambda
# (prior Gamma(2, 0.5)  x  Poisson likelihood)
log_post_ns <- function(lambda) {
  if (lambda <= 0) return(-Inf)                      # lambda must be positive
  dgamma(lambda, shape = 2, rate = 0.5, log = TRUE) +
    sum(dpois(y, lambda, log = TRUE))
}

One iteration

Suppose the chain currently sits at \(\lambda = 2\). One Metropolis step is exactly the three moves from the acceptance rule: propose, score, decide.

Code
set.seed(13)
current <- 2        # current chain location
width   <- 0.8      # proposal SD (symmetric Normal proposal -> Metropolis)

# STEP 1 — propose a nearby value
proposal <- rnorm(1, mean = current, sd = width)

# STEP 2 — acceptance probability = min(1, ratio of unnormalised posteriors)
alpha <- min(1, exp(log_post_ns(proposal) - log_post_ns(current)))

# STEP 3 — flip a coin: move with probability alpha, otherwise stay put
u        <- runif(1)
next_val <- if (u < alpha) proposal else current

cat("current    :", round(current, 3), "\n")
current    : 2 
Code
cat("proposal   :", round(proposal, 3), "\n")
proposal   : 2.443 
Code
cat("alpha      :", round(alpha, 3), "\n")
alpha      : 0.552 
Code
cat("coin u     :", round(u, 3), if (u < alpha) " -> ACCEPT" else " -> reject", "\n")
coin u     : 0.39  -> ACCEPT 
Code
cat("next value :", round(next_val, 3), "\n")
next value : 2.443 

In this draw the proposal is a little less plausible than the current value, so \(\alpha < 1\) — we only might move (Scenario 2 from the acceptance rule). The coin flip happens to fall below \(\alpha\), so we accept. Had it landed above \(\alpha\), the chain would have stayed at \(\lambda = 2\) and recorded that value again.

…and keep going

A full sampler just repeats that step a few thousand times, each iteration starting from wherever the previous one landed:

Code
run_mh <- function(width, start, n_iter) {
  lam <- numeric(n_iter)
  lam[1] <- start
  accepts <- 0
  for (t in 2:n_iter) {
    proposal <- rnorm(1, mean = lam[t - 1], sd = width)
    if (log(runif(1)) < log_post_ns(proposal) - log_post_ns(lam[t - 1])) {
      lam[t] <- proposal; accepts <- accepts + 1
    } else {
      lam[t] <- lam[t - 1]                            # rejected: stay put
    }
  }
  attr(lam, "accept_rate") <- accepts / (n_iter - 1)
  lam
}

set.seed(13)
chain <- run_mh(width = 0.8, start = 2, n_iter = 2000)
round(head(chain, 10), 3)        # the chain's first ten stops
 [1] 2.000 2.443 1.378 1.527 1.527 1.527 1.250 2.134 2.229 1.140

See how some values repeat? Those are the iterations where the proposal was rejected and the chain stayed where it was.

A reusable trace-plot tool

To explore how the sampler behaves under different settings, here is a small helper. Vary the proposal width, the starting value(s), the number of chains, and the chain length, and it returns the trace plot (the dashed line marks the true posterior mean, \(8/4.5 \approx 1.78\)):

Code
trace_demo <- function(width = 0.8, starts = 2, n_chains = 1,
                       n_iter = 2000, seed = 1, title = NULL) {
  set.seed(seed)
  starts <- rep_len(starts, n_chains)                 # one start per chain
  chains <- lapply(starts, function(s) run_mh(width = width, start = s, n_iter = n_iter))
  acc    <- mean(sapply(chains, attr, "accept_rate"))

  df <- data.frame(
    iteration = rep(seq_len(n_iter), times = n_chains),
    lambda    = unlist(chains),
    chain     = factor(rep(seq_len(n_chains), each = n_iter))
  )
  if (is.null(title))
    title <- sprintf("width = %g | start(s) = %s | %d chain%s | accept = %.0f%%",
                     width, paste(round(starts, 2), collapse = ", "),
                     n_chains, if (n_chains > 1) "s" else "", 100 * acc)

  ggplot2::ggplot(df, ggplot2::aes(iteration, lambda, colour = chain)) +
    ggplot2::geom_line(linewidth = 0.3, alpha = 0.85) +
    ggplot2::geom_hline(yintercept = 8 / 4.5, linetype = "dashed") +
    ggplot2::labs(title = title, x = "iteration", y = expression(lambda)) +
    ggplot2::theme_minimal(base_size = 12) +
    ggplot2::theme(legend.position = if (n_chains > 1) "right" else "none",
          plot.title = ggplot2::element_text(size = 10))
}

# A healthy run
trace_demo(width = 0.8, starts = 2)

A well-tuned chain like this is the “hairy caterpillar”: it wanders rapidly around the dashed posterior mean with no trends and no long flat stretches.


Diagnosing the Chain

In practice you let brms (which calls Stan) run the sampler and report the diagnostics automatically — but the ideas are identical whether the draws come from Stan or from the by-hand Metropolis sampler we build below. A Markov chain only approximates the posterior, so we must check that the approximation is trustworthy.

Diagnostic What it checks Healthy sign
Trace plot does a single chain explore stably? “hairy caterpillar”: white noise, no trends or flat runs
Burn-in / warm-up early samples while the chain finds the posterior discard them before summarising
Parallel chains do independent chains agree? several chains from different starts overlap
Effective sample size \(N_{\text{eff}}\) how many independent draws the dependent draws are worth ratio \(N_{\text{eff}}/N\) not too small (suspect if \(< 0.1\))
Autocorrelation how fast dependence between draws fades drops to ~0 within a few lags
\(\hat R\) (R-hat) between-chain vs. within-chain variability \(\approx 1\); a value \(> 1.05\) is a red flag

Visual diagnostics: trace plots

Now we can see the pathologies from Chapter 6.3 just by handing different settings to trace_demo(). A good trace plot looks like featureless white noise around the posterior mean; the troublesome ones drift slowly or stick.

Code
library(gridExtra)
grid.arrange(
  trace_demo(width = 0.8,  starts = 2,            n_chains = 1,
             title = "(a) Healthy: width = 0.8"),
  trace_demo(width = 0.02, starts = 2,            n_chains = 1,
             title = "(b) Width too small (0.02): slow mixing"),
  trace_demo(width = 20,   starts = 2,            n_chains = 1,
             title = "(c) Width too large (20): gets stuck"),
  trace_demo(width = 0.8,  starts = c(0.3, 4, 9), n_chains = 3,
             title = "(d) Three chains from different starts"),
  ncol = 2
)

  • (a) mixes well — rapid exploration around the posterior mean (~61% of proposals accepted).
  • (b) a too-small width takes tiny steps: nearly every proposal is accepted (~99%), but the chain only inches along and is strongly autocorrelated — it has barely explored after 2,000 iterations.
  • (c) a too-large width proposes wild values (often negative, always rejected), so only ~4% are accepted and the chain sticks for long stretches — the flat horizontal runs.
  • (d) three chains launched from \(\lambda \in \{0.3, 4, 9\}\) all migrate to the same band within a few dozen iterations (the burn-in) and then agree — strong evidence of convergence.

Try it live: call trace_demo(width = ..., starts = ..., n_chains = ..., n_iter = ...) with your own settings and watch the mixing — and the acceptance rate in each title — respond.

Numerical diagnostics: effective sample size & R-hat

The numbers back up the picture. We run four longer parallel chains with run_mh() and hand them to coda:

Code
set.seed(2026)
raw <- lapply(c(0.5, 2, 5, 8), function(s) run_mh(width = 0.8, start = s, n_iter = 6000))
mc  <- mcmc.list(lapply(raw, function(ch) mcmc(ch[1001:6000])))   # drop 1,000 burn-in

N    <- 4 * 5000                       # total retained draws
ess  <- sum(effectiveSize(mc))         # effective sample size
rhat <- gelman.diag(mc)$psrf[1, 1]     # split-chain R-hat

cat("Total retained draws   :", N, "\n")
Total retained draws   : 20000 
Code
cat("Effective sample size  :", round(ess), "\n")
Effective sample size  : 2830 
Code
cat("Eff. sample-size ratio :", round(ess / N, 3), "  (suspect if < 0.1)\n")
Eff. sample-size ratio : 0.142   (suspect if < 0.1)
Code
cat("R-hat                  :", round(rhat, 4), "  (red flag if > 1.05)\n")
R-hat                  : 1.0007   (red flag if > 1.05)
Code
cat("Lag-1 autocorrelation  :", round(acf(raw[[1]][1001:6000], plot = FALSE)$acf[2], 3), "\n")
Lag-1 autocorrelation  : 0.728 

Here the ratio sits above 0.1 and \(\hat R \approx 1\), so we trust the approximation. The moderate lag-1 autocorrelation is expected for a random-walk Metropolis sampler — exactly what a better-tuned proposal (Exercise 7.4) or a smarter sampler such as Stan’s would reduce, raising the effective sample size.

The autocorrelation plot

Bayes Rules! inspects chains with the bayesplot package — mcmc_trace() for trace plots, mcmc_dens_overlay() for the parallel-chain densities, and mcmc_acf() for the autocorrelation plot (the book’s Figures 6.15–6.16). The autocorrelation plot shows the correlation between chain values that are 1, 2, 3, … steps apart: in a healthy chain it drops to ~0 within a few lags; in a slow-mixing chain it lingers near 1. The same functions work on our hand-rolled draws, not just Stan output:

Code
library(bayesplot)

# One chain at each setting, packaged as a single-column "draws" matrix
acf_mat <- function(width, seed = 2026) {
  set.seed(seed)
  draws <- run_mh(width = width, start = 2, n_iter = 6000)[1001:6000]
  matrix(draws, ncol = 1, dimnames = list(NULL, "lambda"))
}

grid.arrange(
  mcmc_acf(acf_mat(0.8))  + ggtitle("Well-mixed (width = 0.8): decays fast"),
  mcmc_acf(acf_mat(0.05)) + ggtitle("Slow-mixing (width = 0.05): lingers near 1"),
  ncol = 2
)

On the left the autocorrelation falls away within a handful of lags — the draws quickly become “effectively independent,” giving a high effective sample size. On the right the tiny proposal width keeps consecutive draws almost identical, so the autocorrelation stays near 1 for many lags: lots of iterations, very little independent information. This is the visual signature behind a low \(N_{\text{eff}}/N\) ratio.

Full bayesplot parity: trace & density plots

mcmc_acf() above already accepted our hand-rolled chains; the book’s other two workhorses — mcmc_trace() and mcmc_dens_overlay() — take the same input. This one-liner packs any set of run_mh() chains into the iterations × chains × parameter array that bayesplot expects:

Code
# Convert run_mh() chains into the array bayesplot wants
as_draws <- function(width, starts, n_iter = 6000, burnin = 1000, seed = 2026) {
  set.seed(seed)
  chains <- lapply(starts, function(s) run_mh(width, s, n_iter)[(burnin + 1):n_iter])
  array(unlist(chains), dim = c(n_iter - burnin, length(starts), 1),
        dimnames = list(NULL, paste0("chain", seq_along(starts)), "lambda"))
}

draws <- as_draws(width = 0.8, starts = c(0.5, 2, 5, 8))

mcmc_trace(draws, pars = "lambda")          # overlaid trace plots (book Fig 6.8)

Code
mcmc_dens_overlay(draws, pars = "lambda")   # per-chain density overlay (book Fig 6.13)

These are exactly the diagnostics from Bayes Rules!: overlaid traces that should look like uniform noise, and per-chain densities that should sit on top of one another. Swap in any width/starts to contrast healthy vs. problematic runs live — e.g. as_draws(width = 0.05, starts = c(0.5, 2, 5, 8)) gives chains whose densities visibly disagree.


Using the Posterior Samples

Once you have MCMC draws, all posterior summaries work exactly the same as with conjugate samples:

Code
# theta_mcmc: any posterior sample (grid, MCMC, conjugate — identical workflow)
theta_mcmc <- theta_samples

mean(theta_mcmc)                        # posterior mean
[1] 0.6786158
Code
median(theta_mcmc)                      # posterior median
[1] 0.6833417
Code
mean(theta_mcmc > 0.6)                  # P(theta > 0.6 | y)
[1] 0.8293
Code
quantile(theta_mcmc, c(0.025, 0.975))   # 95% ETI
     2.5%     97.5% 
0.5092546 0.8254127 

In-Class Exercises

Exercise 1: Grid Approximation with a Truncated Prior

A researcher believes the placebo effect rate is between 30% and 90%, and is otherwise uncertain. Prior: Beta(2,2) truncated to \([0.30, 0.90]\). Observed: 18 of 25 patients improve.

  1. Implement the grid approximation.
  2. Compute the posterior mean and 95% ETI.
  3. Compare to an untruncated Beta(2,2) — how much does the truncation matter here?
Code
theta_grid <- seq(0, 1, length.out = 5000)
y_obs <- 18; n_obs <- 25

# Truncated prior
prior_trunc <- dbeta(theta_grid, 2, 2) *
               (theta_grid >= 0.3 & theta_grid <= 0.9)
lik         <- dbinom(y_obs, n_obs, prob = theta_grid)
post_trunc  <- (prior_trunc * lik) / sum(prior_trunc * lik)
samp_trunc  <- sample(theta_grid, 1e4, replace = TRUE, prob = post_trunc)

# Untruncated
prior_full <- dbeta(theta_grid, 2, 2)
post_full  <- (prior_full * lik) / sum(prior_full * lik)
samp_full  <- sample(theta_grid, 1e4, replace = TRUE, prob = post_full)

cat("Truncated: mean =", round(mean(samp_trunc), 3),
    " 95% ETI = [", round(quantile(samp_trunc, 0.025), 3),
    ",",             round(quantile(samp_trunc, 0.975), 3), "]\n")
Truncated: mean = 0.69  95% ETI = [ 0.516 , 0.839 ]
Code
cat("Full:      mean =", round(mean(samp_full),  3),
    " 95% ETI = [", round(quantile(samp_full,  0.025), 3),
    ",",             round(quantile(samp_full,  0.975), 3), "]\n")
Full:      mean = 0.689  95% ETI = [ 0.514 , 0.839 ]
Code
df_plot <- data.frame(
  theta = c(samp_trunc, samp_full),
  Prior = rep(c("Truncated Beta(2,2)", "Full Beta(2,2)"), each = 1e4)
)
ggplot(df_plot, aes(x = theta, fill = Prior)) +
  geom_density(alpha = 0.5) +
  labs(title = "Posterior: truncated vs full prior",
       x = expression(theta), y = "Density") +
  theme_minimal(base_size = 13) +
  theme(legend.title = element_blank())

Exercise 2: Metropolis by hand (Dutch Railways)

Back to NS signal failures with prior Gamma(2, 0.5) and data \(y = (2, 0, 3, 1)\). Pretend you’ve forgotten that the conjugate posterior is Gamma(8, 4.5). Build a tiny Metropolis sampler instead and compare.

  1. Implement a Metropolis sampler for \(\lambda > 0\) with a Normal proposal centred at the current value (e.g. SD = 0.5). Use the unnormalised posterior \(f(\lambda) \cdot L(y \mid \lambda)\) in the acceptance ratio.
  2. Run 5,000 iterations, discard the first 1,000 as burn-in.
  3. Compare your sampled mean and 95% ETI to the analytic posterior Gamma(8, 4.5).
  4. Produce a trace plot. Does the chain look like a “hairy caterpillar”?
Code
set.seed(2026)

y <- c(2, 0, 3, 1)

log_post <- function(lambda) {
  if (lambda <= 0) return(-Inf)
  dgamma(lambda, shape = 2, rate = 0.5, log = TRUE) +
    sum(dpois(y, lambda, log = TRUE))
}

n_iter   <- 5000
lambda   <- numeric(n_iter)
lambda[1] <- 1            # starting value
accepts  <- 0

for (t in 2:n_iter) {
  proposal <- rnorm(1, mean = lambda[t - 1], sd = 0.5)
  log_a <- log_post(proposal) - log_post(lambda[t - 1])
  if (log(runif(1)) < log_a) {
    lambda[t] <- proposal
    accepts   <- accepts + 1
  } else {
    lambda[t] <- lambda[t - 1]
  }
}

cat("Acceptance rate:", round(accepts / (n_iter - 1), 3), "\n")
Acceptance rate: 0.742 
Code
post <- lambda[1001:n_iter]                       # burn-in
cat("MCMC mean :", round(mean(post), 3), "\n")
MCMC mean : 1.77 
Code
cat("Analytic  :", round(8 / 4.5, 3), "\n")
Analytic  : 1.778 
Code
cat("MCMC 95% ETI:", round(quantile(post, c(.025, .975)), 3), "\n")
MCMC 95% ETI: 0.756 3.145 
Code
cat("Analytic   :", round(qgamma(c(.025, .975), 8, 4.5), 3), "\n")
Analytic   : 0.768 3.205 
Code
par(mfrow = c(1, 2))
plot(lambda, type = "l", col = "steelblue",
     xlab = "iteration", ylab = expression(lambda),
     main = "Trace plot")
abline(v = 1000, col = "red", lty = 2)
hist(post, breaks = 40, probability = TRUE, col = "steelblue",
     border = "white", main = "Posterior (after burn-in)",
     xlab = expression(lambda))
curve(dgamma(x, 8, 4.5), add = TRUE, col = "darkred", lwd = 2)

Code
par(mfrow = c(1, 1))

The MCMC mean and ETI should sit very close to the analytic Gamma(8, 4.5) values — same posterior, two different routes.


Next lecture (June 15): Lecture 5 puts the posterior to work: point estimates, credible intervals, posterior prediction, prior + posterior predictive checks, and Bayes factors for hypothesis testing (Ch. 8 + Etz et al. 2018).