Use posterior samples — regardless of how they were obtained — to compute summaries
Recommended Reading:Bayes Rules! Chapters 6 (approximating the posterior) and 7 (MCMC under the hood). Suggested exercises: Ch. 6 — 2, 3, 8; Ch. 7 — 4, 9.
On the exam you don’t need to implement samplers by hand or use Stan, but you should understand the intuition and be able to interpret diagnostics.
Review: Four Ingredients of a Bayesian Analysis
Every Bayesian model is built on the same four components, regardless of model complexity.
The four ingredients
Prior model — beliefs about parameter \(\theta\) before seeing data: \(f(\theta)\)
Data model (likelihood) — how data arise given the parameter: \(f(y \mid \theta)\)
Likelihood function — the same expression viewed as a function of \(\theta\) with \(y\) fixed: \(L(\theta \mid y) \propto f(y \mid \theta)\)
Posterior model — updated beliefs after seeing data: \(f(\theta \mid y) \propto f(\theta) \cdot L(\theta \mid y)\)
The three conjugate families from last lecture:
Likelihood
Prior
Posterior
Binomial\((n, \theta)\)
Beta\((\alpha, \beta)\)
Beta\((\alpha+y,\;\beta+n-y)\)
Poisson\((\lambda)\)
Gamma\((s, r)\)
Gamma\((s+\sum y_i,\; r+n)\)
Normal\((\mu, \sigma^2)\), \(\sigma\) known
Normal\((\mu_0, \tau^2)\)
Normal\((\mu_\text{post}, \sigma^2_\text{post})\)
Today we ask: what happens when the model is not conjugate, or when the posterior simply has no closed form? We turn to numerical approximations — grid approximation and MCMC.
Running Example: Dutch Railways
We will use one running example throughout this lecture, starting with a conjugate set-up and then breaking it.
Scenario
You are a data analyst for the Dutch Railway Service (NS). Your task: model the monthly rate of signal failures on a 10 km stretch of track.
plot_gamma_poisson(shape =2, rate =0.5, sum_y =sum(y), n =length(y))
Code
summarize_gamma_poisson(shape =2, rate =0.5, sum_y =sum(y), n =length(y))
So far so familiar — the Gamma–Poisson is conjugate, and the posterior is a clean Gamma(8, 4.5). But what if the engineer’s prior beliefs aren’t Gamma-shaped? Maybe she’s confident that \(\lambda\) lies in \([1, 5]\) and is unwilling to commit to a particular shape within that range. Or maybe she has two competing hypotheses about the maintenance regime that suggest a bimodal prior. The conjugacy trick breaks — but the mathematics
Limitation: Grid points grow as \(K^p\) with the number of parameters \(p\). With 10 parameters and 100 grid points each, you need \(100^{10} = 10^{20}\) evaluations — infeasible.
MCMC — The Intuition
Markov Chain Monte Carlo draws samples from the posterior without computing the normalising constant.
The island-hopping metaphor (Metropolis algorithm)
A politician visits islands whose populations represent posterior density:
Propose a neighbouring island at random
If it’s bigger → always move
If it’s smaller → move with probability = (proposed / current population)
Otherwise → stay put
Long-run time spent on each island \(\propto\) population — without ever needing the total.
The politician can only count the people on his current island and on the neighbour he proposes. He never takes a world census — that would mean visiting every island and summing them, i.e. computing \(f(y) = \int f(\lambda)\,L(\lambda \mid y)\,d\lambda\), the intractable integral.
And he doesn’t need it. Standing on an island of 1,000, proposing one of 600:
\[\alpha = \frac{600}{1000} = 0.6 .\]
The “true” population shares are \(600/T\) and \(1000/T\) (with \(T\) the world total), but their ratio is \(\tfrac{600/T}{1000/T} = 0.6\) — the unknown total \(T\)cancels. Two local head-counts are all he needs.
Why he still visits the small islands
His rule: a bigger neighbour → always go; a smaller neighbour → go with probability equal to the size ratio.
Why not just “always move to the bigger island”? Then he would climb to the single largest island and never leave — reporting only the posterior’s peak (the mode) and missing its entire spread.
Accepting smaller islands in proportion to their size is what fixes this:
big islands → visited often
a neighbour 60% as big → visited about 60% as often
tiny islands → rarely, but not never
The payoff: over a long tour, the time spent on each island is proportional to its population — i.e. to the posterior plausibility. So the histogram of where he has been reconstructs the whole posterior, in correct proportions, without his ever knowing the total.
The acceptance rule
The metaphor becomes exact through the acceptance probability. Standing at the current value \(\mu\), we propose a new value \(\mu'\) from a (symmetric) proposal model and then accept the move with probability
So the accept/reject decision depends only on the ratio of (unnormalised) posterior plausibilities, which is exactly what we can evaluate as \(f(\mu)\,L(\mu \mid y)\). This is the whole reason MCMC escapes the normalising-constant problem that broke conjugacy.
Two scenarios — the “bigger / smaller island” rules made precise
Proposal at least as plausible (\(f(\mu' \mid y) \ge f(\mu \mid y)\)): \(\alpha = 1\), so we always move.
Proposal less plausible (\(f(\mu' \mid y) < f(\mu \mid y)\)): move with probability \(\alpha = f(\mu' \mid y) / f(\mu \mid y) < 1\). Bolder downhill jumps are accepted less often, so the chain still visits the tails occasionally but spends most of its time where the posterior mass is.
With a symmetric proposal — like the Uniform \(\mu' \mid \mu \sim \text{Unif}(\mu - w, \mu + w)\) used in Exercises 7.4 and 7.9 — the proposal terms cancel and we get exactly the Metropolis rule above. Asymmetric proposals carry an extra factor \(q(\mu \mid \mu')/q(\mu' \mid \mu)\) (the full Metropolis–Hastings rule).
The Algorithm, Step by Step (Dutch Railways)
Let’s run the algorithm on our signal-failure data \(y = (2, 0, 3, 1)\) with the Gamma(2, 0.5) prior. We already know the posterior is Gamma(8, 4.5), so we can hold the sampler up against the truth. The only ingredient we need is the unnormalised log-posterior, \(\log\!\big[f(\lambda)\,L(\lambda \mid y)\big]\):
Code
y <-c(2, 0, 3, 1)# Unnormalised log-posterior for the NS rate lambda# (prior Gamma(2, 0.5) x Poisson likelihood)log_post_ns <-function(lambda) {if (lambda <=0) return(-Inf) # lambda must be positivedgamma(lambda, shape =2, rate =0.5, log =TRUE) +sum(dpois(y, lambda, log =TRUE))}
One iteration
Suppose the chain currently sits at \(\lambda = 2\). One Metropolis step is exactly the three moves from the acceptance rule: propose, score, decide.
Code
set.seed(13)current <-2# current chain locationwidth <-0.8# proposal SD (symmetric Normal proposal -> Metropolis)# STEP 1 — propose a nearby valueproposal <-rnorm(1, mean = current, sd = width)# STEP 2 — acceptance probability = min(1, ratio of unnormalised posteriors)alpha <-min(1, exp(log_post_ns(proposal) -log_post_ns(current)))# STEP 3 — flip a coin: move with probability alpha, otherwise stay putu <-runif(1)next_val <-if (u < alpha) proposal else currentcat("current :", round(current, 3), "\n")
current : 2
Code
cat("proposal :", round(proposal, 3), "\n")
proposal : 2.443
Code
cat("alpha :", round(alpha, 3), "\n")
alpha : 0.552
Code
cat("coin u :", round(u, 3), if (u < alpha) " -> ACCEPT"else" -> reject", "\n")
coin u : 0.39 -> ACCEPT
Code
cat("next value :", round(next_val, 3), "\n")
next value : 2.443
In this draw the proposal is a little less plausible than the current value, so \(\alpha < 1\) — we only might move (Scenario 2 from the acceptance rule). The coin flip happens to fall below \(\alpha\), so we accept. Had it landed above \(\alpha\), the chain would have stayed at \(\lambda = 2\) and recorded that value again.
…and keep going
A full sampler just repeats that step a few thousand times, each iteration starting from wherever the previous one landed:
See how some values repeat? Those are the iterations where the proposal was rejected and the chain stayed where it was.
A reusable trace-plot tool
To explore how the sampler behaves under different settings, here is a small helper. Vary the proposal width, the starting value(s), the number of chains, and the chain length, and it returns the trace plot (the dashed line marks the true posterior mean, \(8/4.5 \approx 1.78\)):
A well-tuned chain like this is the “hairy caterpillar”: it wanders rapidly around the dashed posterior mean with no trends and no long flat stretches.
Diagnosing the Chain
In practice you let brms (which calls Stan) run the sampler and report the diagnostics automatically — but the ideas are identical whether the draws come from Stan or from the by-hand Metropolis sampler we build below. A Markov chain only approximates the posterior, so we must check that the approximation is trustworthy.
Diagnostic
What it checks
Healthy sign
Trace plot
does a single chain explore stably?
“hairy caterpillar”: white noise, no trends or flat runs
Burn-in / warm-up
early samples while the chain finds the posterior
discard them before summarising
Parallel chains
do independent chains agree?
several chains from different starts overlap
Effective sample size\(N_{\text{eff}}\)
how many independent draws the dependent draws are worth
ratio \(N_{\text{eff}}/N\) not too small (suspect if \(< 0.1\))
Autocorrelation
how fast dependence between draws fades
drops to ~0 within a few lags
\(\hat R\) (R-hat)
between-chain vs. within-chain variability
\(\approx 1\); a value \(> 1.05\) is a red flag
Visual diagnostics: trace plots
Now we can see the pathologies from Chapter 6.3 just by handing different settings to trace_demo(). A good trace plot looks like featureless white noise around the posterior mean; the troublesome ones drift slowly or stick.
Code
library(gridExtra)grid.arrange(trace_demo(width =0.8, starts =2, n_chains =1,title ="(a) Healthy: width = 0.8"),trace_demo(width =0.02, starts =2, n_chains =1,title ="(b) Width too small (0.02): slow mixing"),trace_demo(width =20, starts =2, n_chains =1,title ="(c) Width too large (20): gets stuck"),trace_demo(width =0.8, starts =c(0.3, 4, 9), n_chains =3,title ="(d) Three chains from different starts"),ncol =2)
(a) mixes well — rapid exploration around the posterior mean (~61% of proposals accepted).
(b) a too-small width takes tiny steps: nearly every proposal is accepted (~99%), but the chain only inches along and is strongly autocorrelated — it has barely explored after 2,000 iterations.
(c) a too-large width proposes wild values (often negative, always rejected), so only ~4% are accepted and the chain sticks for long stretches — the flat horizontal runs.
(d) three chains launched from \(\lambda \in \{0.3, 4, 9\}\) all migrate to the same band within a few dozen iterations (the burn-in) and then agree — strong evidence of convergence.
Try it live: call trace_demo(width = ..., starts = ..., n_chains = ..., n_iter = ...) with your own settings and watch the mixing — and the acceptance rate in each title — respond.
Here the ratio sits above 0.1 and \(\hat R \approx 1\), so we trust the approximation. The moderate lag-1 autocorrelation is expected for a random-walk Metropolis sampler — exactly what a better-tuned proposal (Exercise 7.4) or a smarter sampler such as Stan’s would reduce, raising the effective sample size.
The autocorrelation plot
Bayes Rules! inspects chains with the bayesplot package — mcmc_trace() for trace plots, mcmc_dens_overlay() for the parallel-chain densities, and mcmc_acf() for the autocorrelation plot (the book’s Figures 6.15–6.16). The autocorrelation plot shows the correlation between chain values that are 1, 2, 3, … steps apart: in a healthy chain it drops to ~0 within a few lags; in a slow-mixing chain it lingers near 1. The same functions work on our hand-rolled draws, not just Stan output:
Code
library(bayesplot)# One chain at each setting, packaged as a single-column "draws" matrixacf_mat <-function(width, seed =2026) {set.seed(seed) draws <-run_mh(width = width, start =2, n_iter =6000)[1001:6000]matrix(draws, ncol =1, dimnames =list(NULL, "lambda"))}grid.arrange(mcmc_acf(acf_mat(0.8)) +ggtitle("Well-mixed (width = 0.8): decays fast"),mcmc_acf(acf_mat(0.05)) +ggtitle("Slow-mixing (width = 0.05): lingers near 1"),ncol =2)
On the left the autocorrelation falls away within a handful of lags — the draws quickly become “effectively independent,” giving a high effective sample size. On the right the tiny proposal width keeps consecutive draws almost identical, so the autocorrelation stays near 1 for many lags: lots of iterations, very little independent information. This is the visual signature behind a low \(N_{\text{eff}}/N\) ratio.
Full bayesplot parity: trace & density plots
mcmc_acf() above already accepted our hand-rolled chains; the book’s other two workhorses — mcmc_trace() and mcmc_dens_overlay() — take the same input. This one-liner packs any set of run_mh() chains into the iterations × chains × parameter array that bayesplot expects:
mcmc_dens_overlay(draws, pars ="lambda") # per-chain density overlay (book Fig 6.13)
These are exactly the diagnostics from Bayes Rules!: overlaid traces that should look like uniform noise, and per-chain densities that should sit on top of one another. Swap in any width/starts to contrast healthy vs. problematic runs live — e.g. as_draws(width = 0.05, starts = c(0.5, 2, 5, 8)) gives chains whose densities visibly disagree.
Using the Posterior Samples
Once you have MCMC draws, all posterior summaries work exactly the same as with conjugate samples:
Code
# theta_mcmc: any posterior sample (grid, MCMC, conjugate — identical workflow)theta_mcmc <- theta_samplesmean(theta_mcmc) # posterior mean
[1] 0.6786158
Code
median(theta_mcmc) # posterior median
[1] 0.6833417
Code
mean(theta_mcmc >0.6) # P(theta > 0.6 | y)
[1] 0.8293
Code
quantile(theta_mcmc, c(0.025, 0.975)) # 95% ETI
2.5% 97.5%
0.5092546 0.8254127
In-Class Exercises
Exercise 1: Grid Approximation with a Truncated Prior
A researcher believes the placebo effect rate is between 30% and 90%, and is otherwise uncertain. Prior: Beta(2,2) truncated to \([0.30, 0.90]\). Observed: 18 of 25 patients improve.
Truncated: mean = 0.69 95% ETI = [ 0.516 , 0.839 ]
Code
cat("Full: mean =", round(mean(samp_full), 3)," 95% ETI = [", round(quantile(samp_full, 0.025), 3),",", round(quantile(samp_full, 0.975), 3), "]\n")
Full: mean = 0.689 95% ETI = [ 0.514 , 0.839 ]
Code
df_plot <-data.frame(theta =c(samp_trunc, samp_full),Prior =rep(c("Truncated Beta(2,2)", "Full Beta(2,2)"), each =1e4))ggplot(df_plot, aes(x = theta, fill = Prior)) +geom_density(alpha =0.5) +labs(title ="Posterior: truncated vs full prior",x =expression(theta), y ="Density") +theme_minimal(base_size =13) +theme(legend.title =element_blank())
Exercise 2: Metropolis by hand (Dutch Railways)
Back to NS signal failures with prior Gamma(2, 0.5) and data \(y = (2, 0, 3, 1)\). Pretend you’ve forgotten that the conjugate posterior is Gamma(8, 4.5). Build a tiny Metropolis sampler instead and compare.
Implement a Metropolis sampler for \(\lambda > 0\) with a Normal proposal centred at the current value (e.g. SD = 0.5). Use the unnormalised posterior \(f(\lambda) \cdot L(y \mid \lambda)\) in the acceptance ratio.
Run 5,000 iterations, discard the first 1,000 as burn-in.
Compare your sampled mean and 95% ETI to the analytic posterior Gamma(8, 4.5).
Produce a trace plot. Does the chain look like a “hairy caterpillar”?
par(mfrow =c(1, 2))plot(lambda, type ="l", col ="steelblue",xlab ="iteration", ylab =expression(lambda),main ="Trace plot")abline(v =1000, col ="red", lty =2)hist(post, breaks =40, probability =TRUE, col ="steelblue",border ="white", main ="Posterior (after burn-in)",xlab =expression(lambda))curve(dgamma(x, 8, 4.5), add =TRUE, col ="darkred", lwd =2)
Code
par(mfrow =c(1, 1))
The MCMC mean and ETI should sit very close to the analytic Gamma(8, 4.5) values — same posterior, two different routes.
Next lecture (June 15): Lecture 5 puts the posterior to work: point estimates, credible intervals, posterior prediction, prior + posterior predictive checks, and Bayes factors for hypothesis testing (Ch. 8 + Etz et al. 2018).