Lecture 4 — Obtaining the Posterior: Grid Approximation & MCMC (Ch. 6 & 7)

Author

Johnny van Doorn

Published

June 10, 2026

By the end of this lecture, you will be able to:

Explain why conjugate models are limited and when you must go beyond them
Implement grid approximation to obtain posterior samples for non-conjugate models
Explain the intuition behind Markov Chain Monte Carlo (MCMC) and the Metropolis algorithm
Read the key MCMC convergence diagnostics (trace plots, \(\hat R\), effective sample size)
Use posterior samples — regardless of how they were obtained — to compute summaries

Recommended Reading: Bayes Rules! Chapters 6 (approximating the posterior) and 7 (MCMC under the hood). Suggested exercises: Ch. 6 — 2, 3, 8; Ch. 7 — 4, 9.

On the exam you don’t need to implement samplers by hand or use Stan, but you should understand the intuition and be able to interpret diagnostics.

Review: Four Ingredients of a Bayesian Analysis

Every Bayesian model is built on the same four components, regardless of model complexity.

The four ingredients

Prior model — beliefs about parameter \(\theta\) before seeing data: \(f(\theta)\)
Data model (likelihood) — how data arise given the parameter: \(f(y \mid \theta)\)
Likelihood function — the same expression viewed as a function of \(\theta\) with \(y\) fixed: \(L(\theta \mid y) \propto f(y \mid \theta)\)
Posterior model — updated beliefs after seeing data: \(f(\theta \mid y) \propto f(\theta) \cdot L(\theta \mid y)\)

The three conjugate families from last lecture:

Likelihood	Prior	Posterior
Binomial\((n, \theta)\)	Beta\((\alpha, \beta)\)	Beta\((\alpha+y,\;\beta+n-y)\)
Poisson\((\lambda)\)	Gamma\((s, r)\)	Gamma\((s+\sum y_i,\; r+n)\)
Normal\((\mu, \sigma^2)\), \(\sigma\) known	Normal\((\mu_0, \tau^2)\)	Normal\((\mu_\text{post}, \sigma^2_\text{post})\)

Today we ask: what happens when the model is not conjugate, or when the posterior simply has no closed form? We turn to numerical approximations — grid approximation and MCMC.

Running Example: Dutch Railways

We will use one running example throughout this lecture, starting with a conjugate set-up and then breaking it.

Scenario

You are a data analyst for the Dutch Railway Service (NS). Your task: model the monthly rate of signal failures on a 10 km stretch of track.

Parameter: \(\lambda =\) mean failures per month
Likelihood: \(Y_i \mid \lambda \overset{\text{iid}}{\sim} \text{Poisson}(\lambda)\)
Prior: Gamma(2, 0.5) — failures are relatively rare, modest prior certainty (mean = 4)

Code

library(bayesrules)

# Observed 4 months: y = (2, 0, 3, 1)
y <- c(2, 0, 3, 1)

# Posterior: Gamma(s + sum(y), r + n) = Gamma(8, 4.5)
s_post <- 2 + sum(y)
r_post <- 0.5 + length(y)
cat("Posterior: Gamma(", s_post, ",", r_post, ")\n")

Posterior: Gamma( 8 , 4.5 )

Code

plot_gamma_poisson(shape = 2, rate = 0.5, sum_y = sum(y), n = length(y))

Code

summarize_gamma_poisson(shape = 2, rate = 0.5, sum_y = sum(y), n = length(y))

So far so familiar — the Gamma–Poisson is conjugate, and the posterior is a clean Gamma(8, 4.5). But what if the engineer’s prior beliefs aren’t Gamma-shaped? Maybe she’s confident that \(\lambda\) lies in \([1, 5]\) and is unwilling to commit to a particular shape within that range. Or maybe she has two competing hypotheses about the maintenance regime that suggest a bimodal prior. The conjugacy trick breaks — but the mathematics

\[f(\lambda \mid y) \propto f(\lambda) \cdot L(y \mid \lambda)\]

still holds. We just can’t read off the answer in closed form. That’s the puzzle the rest of this lecture solves.

Beyond Conjugacy: Grid Approximation & MCMC

Conjugate models require prior and likelihood to match. Real research often involves:

Multiple unknown parameters
Non-standard prior shapes (truncated, mixture)
Complex hierarchical models

The good news: \(f(\theta \mid y) \propto f(\theta) \cdot L(y \mid \theta)\) always holds. We just need numerical methods to compute usable summaries.

Grid Approximation

Define a grid: \(\theta_1, \ldots, \theta_K\)
Evaluate unnormalised posterior at each point: \(u_i = f(\theta_i) \cdot L(y \mid \theta_i)\)
Normalise: \(\hat{f}(\theta_i \mid y) = u_i / \sum_j u_j\)
Sample from the grid with probabilities proportional to \(\hat{f}\)

Code

# Non-conjugate example: Beta(3,3) prior truncated to [0.4, 1]
theta_grid <- seq(0, 1, length.out = 2000)

prior      <- dbeta(theta_grid, 3, 3) * (theta_grid >= 0.4)
likelihood <- dbinom(18, size = 25, prob = theta_grid)

unnorm         <- prior * likelihood
posterior_grid <- unnorm / sum(unnorm)

theta_samples <- sample(theta_grid, size = 1e4,
                        replace = TRUE, prob = posterior_grid)

hist(theta_samples, breaks = 50, probability = TRUE,
     main = "Grid approx: truncated Beta(3,3) prior, y=18, n=25",
     xlab = expression(theta), col = "steelblue", border = "white")

Code

cat("Posterior mean:", round(mean(theta_samples), 3), "\n")

Posterior mean: 0.679

Code

cat("95% ETI: [", round(quantile(theta_samples, 0.025), 3),
    ",", round(quantile(theta_samples, 0.975), 3), "]\n")

95% ETI: [ 0.509 , 0.825 ]

Limitation: Grid points grow as \(K^p\) with the number of parameters \(p\). With 10 parameters and 100 grid points each, you need \(100^{10} = 10^{20}\) evaluations — infeasible.

MCMC — The Intuition

Markov Chain Monte Carlo draws samples from the posterior without computing the normalising constant.

The island-hopping metaphor (Metropolis algorithm)

A politician visits islands whose populations represent posterior density:

Propose a neighbouring island at random
If it’s bigger → always move
If it’s smaller → move with probability = (proposed / current population)
Otherwise → stay put

Long-run time spent on each island \(\propto\) population — without ever needing the total.

The politician and the unknown total

The metaphor maps exactly onto the posterior:

Island world	Posterior
an island	a parameter value \(\lambda\)
island population	plausibility \(f(\lambda)\,L(\lambda \mid y)\) (prior × likelihood)
total population of all islands	normalising constant \(f(y)\)

The politician can only count the people on his current island and on the neighbour he proposes. He never takes a world census — that would mean visiting every island and summing them, i.e. computing \(f(y) = \int f(\lambda)\,L(\lambda \mid y)\,d\lambda\), the intractable integral.

And he doesn’t need it. Standing on an island of 1,000, proposing one of 600:

\[\alpha = \frac{600}{1000} = 0.6 .\]

The “true” population shares are \(600/T\) and \(1000/T\) (with \(T\) the world total), but their ratio is \(\tfrac{600/T}{1000/T} = 0.6\) — the unknown total \(T\) cancels. Two local head-counts are all he needs.

Why he still visits the small islands

His rule: a bigger neighbour → always go; a smaller neighbour → go with probability equal to the size ratio.

Why not just “always move to the bigger island”? Then he would climb to the single largest island and never leave — reporting only the posterior’s peak (the mode) and missing its entire spread.

Accepting smaller islands in proportion to their size is what fixes this:

big islands → visited often
a neighbour 60% as big → visited about 60% as often
tiny islands → rarely, but not never

The payoff: over a long tour, the time spent on each island is proportional to its population — i.e. to the posterior plausibility. So the histogram of where he has been reconstructs the whole posterior, in correct proportions, without his ever knowing the total.

The acceptance rule

The metaphor becomes exact through the acceptance probability. Standing at the current value \(\mu\), we propose a new value \(\mu'\) from a (symmetric) proposal model and then accept the move with probability

\[\alpha = \min\!\left\{1,\; \frac{f(\mu')\,L(\mu' \mid y)}{f(\mu)\,L(\mu \mid y)}\right\}.\]

Here is the crucial trick. Divide the numerator and denominator by the intractable normalising constant \(f(y)\) — it cancels:

\[\alpha = \min\!\left\{1,\; \frac{f(\mu' \mid y)}{f(\mu \mid y)}\right\}.\]

So the accept/reject decision depends only on the ratio of (unnormalised) posterior plausibilities, which is exactly what we can evaluate as \(f(\mu)\,L(\mu \mid y)\). This is the whole reason MCMC escapes the normalising-constant problem that broke conjugacy.

Two scenarios — the “bigger / smaller island” rules made precise

Proposal at least as plausible (\(f(\mu' \mid y) \ge f(\mu \mid y)\)): \(\alpha = 1\), so we always move.
Proposal less plausible (\(f(\mu' \mid y) < f(\mu \mid y)\)): move with probability \(\alpha = f(\mu' \mid y) / f(\mu \mid y) < 1\). Bolder downhill jumps are accepted less often, so the chain still visits the tails occasionally but spends most of its time where the posterior mass is.

With a symmetric proposal — like the Uniform \(\mu' \mid \mu \sim \text{Unif}(\mu - w, \mu + w)\) used in Exercises 7.4 and 7.9 — the proposal terms cancel and we get exactly the Metropolis rule above. Asymmetric proposals carry an extra factor \(q(\mu \mid \mu')/q(\mu' \mid \mu)\) (the full Metropolis–Hastings rule).

The Algorithm, Step by Step (Dutch Railways)

Let’s run the algorithm on our signal-failure data \(y = (2, 0, 3, 1)\) with the Gamma(2, 0.5) prior. We already know the posterior is Gamma(8, 4.5), so we can hold the sampler up against the truth. The only ingredient we need is the unnormalised log-posterior, \(\log\!\big[f(\lambda)\,L(\lambda \mid y)\big]\):

Code

y <- c(2, 0, 3, 1)

# Unnormalised log-posterior for the NS rate lambda
# (prior Gamma(2, 0.5)  x  Poisson likelihood)
log_post_ns <- function(lambda) {
  if (lambda <= 0) return(-Inf)                      # lambda must be positive
  dgamma(lambda, shape = 2, rate = 0.5, log = TRUE) +
    sum(dpois(y, lambda, log = TRUE))
}

One iteration

Suppose the chain currently sits at \(\lambda = 2\). One Metropolis step is exactly the three moves from the acceptance rule: propose, score, decide.

Code

set.seed(13)
current <- 2        # current chain location
width   <- 0.8      # proposal SD (symmetric Normal proposal -> Metropolis)

# STEP 1 — propose a nearby value
proposal <- rnorm(1, mean = current, sd = width)

# STEP 2 — acceptance probability = min(1, ratio of unnormalised posteriors)
alpha <- min(1, exp(log_post_ns(proposal) - log_post_ns(current)))

# STEP 3 — flip a coin: move with probability alpha, otherwise stay put
u        <- runif(1)
next_val <- if (u < alpha) proposal else current

cat("current    :", round(current, 3), "\n")

current    : 2

Code

cat("proposal   :", round(proposal, 3), "\n")

proposal   : 2.443

Code

cat("alpha      :", round(alpha, 3), "\n")

alpha      : 0.552

Code

cat("coin u     :", round(u, 3), if (u < alpha) " -> ACCEPT" else " -> reject", "\n")

coin u     : 0.39  -> ACCEPT

Code

cat("next value :", round(next_val, 3), "\n")

next value : 2.443

In this draw the proposal is a little less plausible than the current value, so \(\alpha < 1\) — we only might move (Scenario 2 from the acceptance rule). The coin flip happens to fall below \(\alpha\), so we accept. Had it landed above \(\alpha\), the chain would have stayed at \(\lambda = 2\) and recorded that value again.

…and keep going

A full sampler just repeats that step a few thousand times, each iteration starting from wherever the previous one landed:

Code

run_mh <- function(width, start, n_iter) {
  lam <- numeric(n_iter)
  lam[1] <- start
  accepts <- 0
  for (t in 2:n_iter) {
    proposal <- rnorm(1, mean = lam[t - 1], sd = width)
    if (log(runif(1)) < log_post_ns(proposal) - log_post_ns(lam[t - 1])) {
      lam[t] <- proposal; accepts <- accepts + 1
    } else {
      lam[t] <- lam[t - 1]                            # rejected: stay put
    }
  }
  attr(lam, "accept_rate") <- accepts / (n_iter - 1)
  lam
}

set.seed(13)
chain <- run_mh(width = 0.8, start = 2, n_iter = 2000)
round(head(chain, 10), 3)        # the chain's first ten stops

 [1] 2.000 2.443 1.378 1.527 1.527 1.527 1.250 2.134 2.229 1.140

See how some values repeat? Those are the iterations where the proposal was rejected and the chain stayed where it was.

A reusable trace-plot tool

To explore how the sampler behaves under different settings, here is a small helper. Vary the proposal width, the starting value(s), the number of chains, and the chain length, and it returns the trace plot (the dashed line marks the true posterior mean, \(8/4.5 \approx 1.78\)):

Code

trace_demo <- function(width = 0.8, starts = 2, n_chains = 1,
                       n_iter = 2000, seed = 1, title = NULL) {
  set.seed(seed)
  starts <- rep_len(starts, n_chains)                 # one start per chain
  chains <- lapply(starts, function(s) run_mh(width = width, start = s, n_iter = n_iter))
  acc    <- mean(sapply(chains, attr, "accept_rate"))

  df <- data.frame(
    iteration = rep(seq_len(n_iter), times = n_chains),
    lambda    = unlist(chains),
    chain     = factor(rep(seq_len(n_chains), each = n_iter))
  )
  if (is.null(title))
    title <- sprintf("width = %g | start(s) = %s | %d chain%s | accept = %.0f%%",
                     width, paste(round(starts, 2), collapse = ", "),
                     n_chains, if (n_chains > 1) "s" else "", 100 * acc)

  ggplot2::ggplot(df, ggplot2::aes(iteration, lambda, colour = chain)) +
    ggplot2::geom_line(linewidth = 0.3, alpha = 0.85) +
    ggplot2::geom_hline(yintercept = 8 / 4.5, linetype = "dashed") +
    ggplot2::labs(title = title, x = "iteration", y = expression(lambda)) +
    ggplot2::theme_minimal(base_size = 12) +
    ggplot2::theme(legend.position = if (n_chains > 1) "right" else "none",
          plot.title = ggplot2::element_text(size = 10))
}

# A healthy run
trace_demo(width = 0.8, starts = 2)

A well-tuned chain like this is the “hairy caterpillar”: it wanders rapidly around the dashed posterior mean with no trends and no long flat stretches.

Diagnosing the Chain

In practice you let brms (which calls Stan) run the sampler and report the diagnostics automatically — but the ideas are identical whether the draws come from Stan or from the by-hand Metropolis sampler we build below. A Markov chain only approximates the posterior, so we must check that the approximation is trustworthy.

Diagnostic	What it checks	Healthy sign
Trace plot	does a single chain explore stably?	“hairy caterpillar”: white noise, no trends or flat runs
Burn-in / warm-up	early samples while the chain finds the posterior	discard them before summarising
Parallel chains	do independent chains agree?	several chains from different starts overlap
Effective sample size \(N_{\text{eff}}\)	how many independent draws the dependent draws are worth	ratio \(N_{\text{eff}}/N\) not too small (suspect if \(< 0.1\))
Autocorrelation	how fast dependence between draws fades	drops to ~0 within a few lags
\(\hat R\) (R-hat)	between-chain vs. within-chain variability	\(\approx 1\); a value \(> 1.05\) is a red flag

Visual diagnostics: trace plots

Now we can see the pathologies from Chapter 6.3 just by handing different settings to trace_demo(). A good trace plot looks like featureless white noise around the posterior mean; the troublesome ones drift slowly or stick.

Code

library(gridExtra)
grid.arrange(
  trace_demo(width = 0.8,  starts = 2,            n_chains = 1,
             title = "(a) Healthy: width = 0.8"),
  trace_demo(width = 0.02, starts = 2,            n_chains = 1,
             title = "(b) Width too small (0.02): slow mixing"),
  trace_demo(width = 20,   starts = 2,            n_chains = 1,
             title = "(c) Width too large (20): gets stuck"),
  trace_demo(width = 0.8,  starts = c(0.3, 4, 9), n_chains = 3,
             title = "(d) Three chains from different starts"),
  ncol = 2
)

(a) mixes well — rapid exploration around the posterior mean (~61% of proposals accepted).
(b) a too-small width takes tiny steps: nearly every proposal is accepted (~99%), but the chain only inches along and is strongly autocorrelated — it has barely explored after 2,000 iterations.
(c) a too-large width proposes wild values (often negative, always rejected), so only ~4% are accepted and the chain sticks for long stretches — the flat horizontal runs.
(d) three chains launched from \(\lambda \in \{0.3, 4, 9\}\) all migrate to the same band within a few dozen iterations (the burn-in) and then agree — strong evidence of convergence.

Try it live: call trace_demo(width = ..., starts = ..., n_chains = ..., n_iter = ...) with your own settings and watch the mixing — and the acceptance rate in each title — respond.

Numerical diagnostics: effective sample size & R-hat

The numbers back up the picture. We run four longer parallel chains with run_mh() and hand them to coda:

Code

set.seed(2026)
raw <- lapply(c(0.5, 2, 5, 8), function(s) run_mh(width = 0.8, start = s, n_iter = 6000))
mc  <- mcmc.list(lapply(raw, function(ch) mcmc(ch[1001:6000])))   # drop 1,000 burn-in

N    <- 4 * 5000                       # total retained draws
ess  <- sum(effectiveSize(mc))         # effective sample size
rhat <- gelman.diag(mc)$psrf[1, 1]     # split-chain R-hat

cat("Total retained draws   :", N, "\n")

Total retained draws   : 20000

Code

cat("Effective sample size  :", round(ess), "\n")

Effective sample size  : 2830

Code

cat("Eff. sample-size ratio :", round(ess / N, 3), "  (suspect if < 0.1)\n")

Eff. sample-size ratio : 0.142   (suspect if < 0.1)

Code

cat("R-hat                  :", round(rhat, 4), "  (red flag if > 1.05)\n")

R-hat                  : 1.0007   (red flag if > 1.05)

Code

cat("Lag-1 autocorrelation  :", round(acf(raw[[1]][1001:6000], plot = FALSE)$acf[2], 3), "\n")

Lag-1 autocorrelation  : 0.728

Here the ratio sits above 0.1 and \(\hat R \approx 1\), so we trust the approximation. The moderate lag-1 autocorrelation is expected for a random-walk Metropolis sampler — exactly what a better-tuned proposal (Exercise 7.4) or a smarter sampler such as Stan’s would reduce, raising the effective sample size.

The autocorrelation plot

Bayes Rules! inspects chains with the bayesplot package — mcmc_trace() for trace plots, mcmc_dens_overlay() for the parallel-chain densities, and mcmc_acf() for the autocorrelation plot (the book’s Figures 6.15–6.16). The autocorrelation plot shows the correlation between chain values that are 1, 2, 3, … steps apart: in a healthy chain it drops to ~0 within a few lags; in a slow-mixing chain it lingers near 1. The same functions work on our hand-rolled draws, not just Stan output:

Code

library(bayesplot)

# One chain at each setting, packaged as a single-column "draws" matrix
acf_mat <- function(width, seed = 2026) {
  set.seed(seed)
  draws <- run_mh(width = width, start = 2, n_iter = 6000)[1001:6000]
  matrix(draws, ncol = 1, dimnames = list(NULL, "lambda"))
}

grid.arrange(
  mcmc_acf(acf_mat(0.8))  + ggtitle("Well-mixed (width = 0.8): decays fast"),
  mcmc_acf(acf_mat(0.05)) + ggtitle("Slow-mixing (width = 0.05): lingers near 1"),
  ncol = 2
)

On the left the autocorrelation falls away within a handful of lags — the draws quickly become “effectively independent,” giving a high effective sample size. On the right the tiny proposal width keeps consecutive draws almost identical, so the autocorrelation stays near 1 for many lags: lots of iterations, very little independent information. This is the visual signature behind a low \(N_{\text{eff}}/N\) ratio.

Full `bayesplot` parity: trace & density plots

mcmc_acf() above already accepted our hand-rolled chains; the book’s other two workhorses — mcmc_trace() and mcmc_dens_overlay() — take the same input. This one-liner packs any set of run_mh() chains into the iterations × chains × parameter array that bayesplot expects:

Code

# Convert run_mh() chains into the array bayesplot wants
as_draws <- function(width, starts, n_iter = 6000, burnin = 1000, seed = 2026) {
  set.seed(seed)
  chains <- lapply(starts, function(s) run_mh(width, s, n_iter)[(burnin + 1):n_iter])
  array(unlist(chains), dim = c(n_iter - burnin, length(starts), 1),
        dimnames = list(NULL, paste0("chain", seq_along(starts)), "lambda"))
}

draws <- as_draws(width = 0.8, starts = c(0.5, 2, 5, 8))

mcmc_trace(draws, pars = "lambda")          # overlaid trace plots (book Fig 6.8)

Code

mcmc_dens_overlay(draws, pars = "lambda")   # per-chain density overlay (book Fig 6.13)

These are exactly the diagnostics from Bayes Rules!: overlaid traces that should look like uniform noise, and per-chain densities that should sit on top of one another. Swap in any width/starts to contrast healthy vs. problematic runs live — e.g. as_draws(width = 0.05, starts = c(0.5, 2, 5, 8)) gives chains whose densities visibly disagree.

Using the Posterior Samples

Once you have MCMC draws, all posterior summaries work exactly the same as with conjugate samples:

Code

# theta_mcmc: any posterior sample (grid, MCMC, conjugate — identical workflow)
theta_mcmc <- theta_samples

mean(theta_mcmc)                        # posterior mean

[1] 0.6786158

Code

median(theta_mcmc)                      # posterior median

[1] 0.6833417

Code

mean(theta_mcmc > 0.6)                  # P(theta > 0.6 | y)

[1] 0.8293

Code

quantile(theta_mcmc, c(0.025, 0.975))   # 95% ETI

     2.5%     97.5% 
0.5092546 0.8254127

In-Class Exercises

Exercise 1: Grid Approximation with a Truncated Prior

A researcher believes the placebo effect rate is between 30% and 90%, and is otherwise uncertain. Prior: Beta(2,2) truncated to \([0.30, 0.90]\). Observed: 18 of 25 patients improve.

Implement the grid approximation.
Compute the posterior mean and 95% ETI.
Compare to an untruncated Beta(2,2) — how much does the truncation matter here?

Code

theta_grid <- seq(0, 1, length.out = 5000)
y_obs <- 18; n_obs <- 25

# Truncated prior
prior_trunc <- dbeta(theta_grid, 2, 2) *
               (theta_grid >= 0.3 & theta_grid <= 0.9)
lik         <- dbinom(y_obs, n_obs, prob = theta_grid)
post_trunc  <- (prior_trunc * lik) / sum(prior_trunc * lik)
samp_trunc  <- sample(theta_grid, 1e4, replace = TRUE, prob = post_trunc)

# Untruncated
prior_full <- dbeta(theta_grid, 2, 2)
post_full  <- (prior_full * lik) / sum(prior_full * lik)
samp_full  <- sample(theta_grid, 1e4, replace = TRUE, prob = post_full)

cat("Truncated: mean =", round(mean(samp_trunc), 3),
    " 95% ETI = [", round(quantile(samp_trunc, 0.025), 3),
    ",",             round(quantile(samp_trunc, 0.975), 3), "]\n")

Truncated: mean = 0.69  95% ETI = [ 0.516 , 0.839 ]

Code

cat("Full:      mean =", round(mean(samp_full),  3),
    " 95% ETI = [", round(quantile(samp_full,  0.025), 3),
    ",",             round(quantile(samp_full,  0.975), 3), "]\n")

Full:      mean = 0.689  95% ETI = [ 0.514 , 0.839 ]

Code

df_plot <- data.frame(
  theta = c(samp_trunc, samp_full),
  Prior = rep(c("Truncated Beta(2,2)", "Full Beta(2,2)"), each = 1e4)
)
ggplot(df_plot, aes(x = theta, fill = Prior)) +
  geom_density(alpha = 0.5) +
  labs(title = "Posterior: truncated vs full prior",
       x = expression(theta), y = "Density") +
  theme_minimal(base_size = 13) +
  theme(legend.title = element_blank())

Exercise 2: Metropolis by hand (Dutch Railways)

Back to NS signal failures with prior Gamma(2, 0.5) and data \(y = (2, 0, 3, 1)\). Pretend you’ve forgotten that the conjugate posterior is Gamma(8, 4.5). Build a tiny Metropolis sampler instead and compare.

Implement a Metropolis sampler for \(\lambda > 0\) with a Normal proposal centred at the current value (e.g. SD = 0.5). Use the unnormalised posterior \(f(\lambda) \cdot L(y \mid \lambda)\) in the acceptance ratio.
Run 5,000 iterations, discard the first 1,000 as burn-in.
Compare your sampled mean and 95% ETI to the analytic posterior Gamma(8, 4.5).
Produce a trace plot. Does the chain look like a “hairy caterpillar”?

Code

set.seed(2026)

y <- c(2, 0, 3, 1)

log_post <- function(lambda) {
  if (lambda <= 0) return(-Inf)
  dgamma(lambda, shape = 2, rate = 0.5, log = TRUE) +
    sum(dpois(y, lambda, log = TRUE))
}

n_iter   <- 5000
lambda   <- numeric(n_iter)
lambda[1] <- 1            # starting value
accepts  <- 0

for (t in 2:n_iter) {
  proposal <- rnorm(1, mean = lambda[t - 1], sd = 0.5)
  log_a <- log_post(proposal) - log_post(lambda[t - 1])
  if (log(runif(1)) < log_a) {
    lambda[t] <- proposal
    accepts   <- accepts + 1
  } else {
    lambda[t] <- lambda[t - 1]
  }
}

cat("Acceptance rate:", round(accepts / (n_iter - 1), 3), "\n")

Acceptance rate: 0.742

Code

post <- lambda[1001:n_iter]                       # burn-in
cat("MCMC mean :", round(mean(post), 3), "\n")

MCMC mean : 1.77

Code

cat("Analytic  :", round(8 / 4.5, 3), "\n")

Analytic  : 1.778

Code

cat("MCMC 95% ETI:", round(quantile(post, c(.025, .975)), 3), "\n")

MCMC 95% ETI: 0.756 3.145

Code

cat("Analytic   :", round(qgamma(c(.025, .975), 8, 4.5), 3), "\n")

Analytic   : 0.768 3.205

Code

par(mfrow = c(1, 2))
plot(lambda, type = "l", col = "steelblue",
     xlab = "iteration", ylab = expression(lambda),
     main = "Trace plot")
abline(v = 1000, col = "red", lty = 2)
hist(post, breaks = 40, probability = TRUE, col = "steelblue",
     border = "white", main = "Posterior (after burn-in)",
     xlab = expression(lambda))
curve(dgamma(x, 8, 4.5), add = TRUE, col = "darkred", lwd = 2)

Code

par(mfrow = c(1, 1))

The MCMC mean and ETI should sit very close to the analytic Gamma(8, 4.5) values — same posterior, two different routes.

Next lecture (June 15): Lecture 5 puts the posterior to work: point estimates, credible intervals, posterior prediction, prior + posterior predictive checks, and Bayes factors for hypothesis testing (Ch. 8 + Etz et al. 2018).

Review: Four Ingredients of a Bayesian Analysis

Running Example: Dutch Railways

Beyond Conjugacy: Grid Approximation & MCMC

Grid Approximation

MCMC — The Intuition

The politician and the unknown total

Why he still visits the small islands

The acceptance rule

The Algorithm, Step by Step (Dutch Railways)

One iteration

…and keep going

A reusable trace-plot tool

Diagnosing the Chain

Visual diagnostics: trace plots

Numerical diagnostics: effective sample size & R-hat

The autocorrelation plot

Full bayesplot parity: trace & density plots

Using the Posterior Samples

In-Class Exercises

Exercise 1: Grid Approximation with a Truncated Prior

Exercise 2: Metropolis by hand (Dutch Railways)

Full `bayesplot` parity: trace & density plots