Markov chain Monte Carlo (MCMC)
Metropolis-Hastings algorithm
Gibbs sampler
Given unnormalized probability density/mass function \(u(x)\propto \pi(x)\), estimate high-dimensional integral
$$I=\mathbb{E}_X[f(X)]=\int f(x)\pi(x)\mathrm{d}x,\quad X\sim \pi(x).$$
Given unnormalized probability density/mass function \(u(x)\propto \pi(x)\), estimate high-dimensional integral
$$I=\mathbb{E}_X[f(X)]=\int f(x)\pi(x)\mathrm{d}x,\quad X\sim \pi(x).$$
We have already introduced inverse transform algorithm, rejection sampling, and importance sampling.
Given unnormalized probability density/mass function \(u(x)\propto \pi(x)\), estimate high-dimensional integral
$$I=\mathbb{E}_X[f(X)]=\int f(x)\pi(x)\mathrm{d}x,\quad X\sim \pi(x).$$
We have already introduced inverse transform algorithm, rejection sampling, and importance sampling.
But they are not "general" enough.
Markov chain Monte Carlo (MCMC) tackles this problem by constructing a Markov chain
$$X_0\rightarrow X_1\rightarrow \cdots \rightarrow X_N,$$
and approximate the integral as
$$\hat{I}=\frac{1}{N}\sum_{i=1}^{N}f(X_{i}).$$
We hope that \(\hat{I}\rightarrow I=\mathbb{E}_X[f(X)]\) almost surely as \(N\rightarrow \infty\).
Unlike previously introduced Monte Carlo methods, the Markov chain \(X_0\rightarrow\cdots\rightarrow X_N\) in general contains dependent sample points \(\{X_i\}\).
Unlike previously introduced Monte Carlo methods, the Markov chain \(X_0\rightarrow\cdots\rightarrow X_N\) in general contains dependent sample points \(\{X_i\}\).
The nature of Markov chains implies that \(X_{t+1}\) should be only based on \(X_t\) (with other independent sources of randomness) and not on \(X_0,\ldots,X_{t-1}\).
Unlike previously introduced Monte Carlo methods, the Markov chain \(X_0\rightarrow\cdots\rightarrow X_N\) in general contains dependent sample points \(\{X_i\}\).
The nature of Markov chains implies that \(X_{t+1}\) should be only based on \(X_t\) (with other independent sources of randomness) and not on \(X_0,\ldots,X_{t-1}\).
Therefore, with a given starting point \(X_0\), the whole Markov chain can be specified via a conditional distribution \(\mathcal{T}(y|x)\):
$$X_1\sim \mathcal{T}(\cdot|X_0),\quad X_2\sim \mathcal{T}(\cdot|X_1),\quad\ldots$$
Such a conditional distribution is typically called a transition kernel
Denoted as \(\mathcal{T}(x\rightarrow y)\equiv \mathcal{T}(y|x)\)
Think of \(x\) as the current position/state, and \(y\) the next position/state in the Markov chain
Such a conditional distribution is typically called a transition kernel
Denoted as \(\mathcal{T}(x\rightarrow y)\equiv \mathcal{T}(y|x)\)
Think of \(x\) as the current position/state, and \(y\) the next position/state in the Markov chain
Our target is to find such a \(\mathcal{T}(x\rightarrow y)\) such that \(\hat{I}\rightarrow I\ \ a.s.\) holds
We first require \(\mathcal{T}(x\rightarrow y)\) to satisfy some mild regularity conditions:
Irreducible: it is possible to get to any state from any state
Aperiodic: the chain cannot get trapped in cycles
Positive recurrent: revists every neighborhood infinitely often
These conditions are rather weak and can be easily satisfied.
The "real" condition that matters is that \(\pi(x)\) is an invariant/stationary distribution of \(\mathcal{T}(x\rightarrow y)\), meaning that
$$\pi(y)=\int\pi(x)\mathcal{T}(x\rightarrow y)\mathrm{d}x.$$
The "real" condition that matters is that \(\pi(x)\) is an invariant/stationary distribution of \(\mathcal{T}(x\rightarrow y)\), meaning that
$$\pi(y)=\int\pi(x)\mathcal{T}(x\rightarrow y)\mathrm{d}x.$$
Combined with the regularity conditions, we have \(\hat{I}\rightarrow I\ \ a.s.\)
This property is called the ergodicity of Markov chains
\(X_0\) can be chosen arbitrarily
How to satisfy the stationary distribution condition?
We typically consider a stronger, but easier-to-verify condition called detailed balance: $$\pi(x)\mathcal{T}(x\rightarrow y)=\pi(y)\mathcal{T}(y\rightarrow x)$$
Easy to verify that it implies the stationary distribution condition
Sufficient but not necessary
Metropolis-Hastings algorithm
Gibbs sampler
Metropolis-adjusted Langevin algorithm
Hamiltonian Monte Carlo
Listed as one of the "Top 10 Algorithms" in the 20th century [1]
Foundation of many other MCMC algorithms
Simple and widely used
Choose an initial value \(X_0\)
Choose an arbitrary proposal distribution \(q(y|x)\)
Given \(X_t\), propose a new point \(Y\sim q(\cdot|X_t)\)
Define $$\alpha=\min\left(1,\frac{\pi(Y)q(X_{t}|Y)}{\pi(X_{t})q(Y|X_{t})}\right)$$
With probability \(\alpha\) set \(X_{t+1}=Y\); otherwise set \(X_{t+1}=X_t\)
We do not really need to know \(\pi(x)\)
The unnormalized density \(u(x)\propto \pi(x)\) suffices, since $$\alpha=\min\left(1,\frac{\pi(Y)q(X_{t}|Y)}{\pi(X_{t})q(Y|X_{t})}\right)=\min\left(1,\frac{u(Y)q(X_{t}|Y)}{u(X_{t})q(Y|X_{t})}\right)$$
\(q(y|x)\) can be chosen almost completely freely (but some would be better)
These properties show the real power of M-H
If the "acceptance" step \(X_{t+1}=Y\) fails, we still need to make \(X_{t+1}=X_t\) (i.e., repeat the previous position once)
This is different from the rejection sampling
Let \(\alpha(x,y)=\min\left(1,\frac{u(y)q(x|y)}{u(x)q(y|x)}\right)\)
Probability of proposing \(x_{t+1}\) and accepting it is \(q(x_{t+1}|x_t)\alpha(x_t,x_{t+1})\)
Probability of accepting a move is \(\alpha(x_t)=\int q(y|x_t)\alpha(x_t,y)\mathrm{d}y\)
The transition kernel is $$\mathcal{T}(x_{t}\rightarrow x_{t+1})=q(x_{t+1}|x_{t})\alpha(x_{t},x_{t+1})+(1-\alpha(x_{t}))\delta(x_{t+1}=x_{t}),$$ where \(\delta(\cdot)\) is a Dirac measure
We need to verify $$\pi(x_t)\mathcal{T}(x_t\rightarrow x_{t+1})=\pi(x_{t+1})\mathcal{T}(x_{t+1}\rightarrow x_t)$$
First term of LHS is $$\begin{align*} & \pi(x_{t})q(x_{t+1}|x_{t})\alpha(x_{t},x_{t+1})\\ =\, & \pi(x_{t})q(x_{t+1}|x_{t})\cdot\min\left(1,\frac{\pi(x_{t+1})q(x_{t}|x_{t+1})}{\pi(x_{t})q(x_{t+1}|x_{t})}\right)\\ =\, & \min\left(\pi(x_{t})q(x_{t+1}|x_{t}),\pi(x_{t+1})q(x_{t}|x_{t+1})\right) \end{align*}$$
Second term of LHS is $$\pi(x_{t})(1-\alpha(x_{t}))\delta(x_{t}=x_{t+1})=\pi(x_{t+1})(1-\alpha(x_{t+1}))\delta(x_{t+1}=x_{t})$$
The Metropolis-Hastings algorithm gives full freedom in choosing the proposal distribution \(q(y|x)\)
One commonly-used implementation of the M-H algorithm is the random walk M-H
It makes \(q(y|x)\) a normal distribution \(N(x,\sigma^2 I)\)
The acceptance probability simplifies to $$\alpha=\min\left(1,\frac{u(Y)}{u(X_{t})}\right)$$
\(\sigma^2\) is a hyperparameter
https://chi-feng.github.io/mcmc-demo/app.html?algorithm=RandomWalkMH&target=banana
Do we have a better choice than random walk M-H?
How to design new MCMC algorithm based on M-H?
We will see an excellent example based on the Langevin algorithm
A good review of the history of M-H can be found in [2]
Also introduces a lot of extensions of M-H
Suppose the variable \(X\) we want to sample can be partitioned as \(X=(X_1,\ldots,X_d)\)
Each component \(X_i\) can be a scalar or a vector
The Gibbs sampler has the following iterations:
Select an index \(i\) from \(\{1,\ldots,d\}\) randomly or deterministically
Sample \(X_i^{(k+1)}\sim \pi(x_i|X_{-i}^{(k)})\)
Set \(X^{(k+1)}=(X_1^{(k)},\ldots,X_i^{(k+1)},\ldots,X_d^{(k)})\)
Suppose index \(i\) is selected with probability \(\rho_i\)
\(X^{(k+1)}\) and \(X^{(k)}\) differ only in component \(X_i\)
\(\mathcal{T}(x^{(k)}\rightarrow x^{(k+1)})=\rho_i\pi\left(x_i^{(k+1)}|x_{-i}^{(k)}\right)\)
We need to verify $$\pi(x^{(k)})\mathcal{T}(x^{(k)}\rightarrow x^{(k+1)})=\pi(x^{(k+1)})\mathcal{T}(x^{(k+1)}\rightarrow x^{(k)})$$
Note that \(x_{-i}^{(k)}=x_{-i}^{(k+1)}\)
LHS is $$\begin{align*} & \pi(x^{(k)})\rho_{i}\pi\left(x_{i}^{(k+1)}|x_{-i}^{(k)}\right)\\ =\, & \pi(x_{-i}^{(k)})\pi\left(x_{i}^{(k)}|x_{-i}^{(k)}\right)\rho_{i}\pi\left(x_{i}^{(k+1)}|x_{-i}^{(k)}\right)\\ =\, & \pi(x_{-i}^{(k+1)})\pi\left(x_{i}^{(k)}|x_{-i}^{(k+1)}\right)\rho_{i}\pi\left(x_{i}^{(k+1)}|x_{-i}^{(k+1)}\right) \end{align*}$$
By symmetry, equal to RHS
https://chi-feng.github.io/mcmc-demo/app.html?algorithm=GibbsSampling&target=banana
[1] Jack Dongarra & Francis Sullivan (2000). Guest Editors' Introduction to the top 10 algorithms. Computing in Science & Engineering.
[2] D. B. Dunson & J. E. Johndrow (2020). The Hastings algorithm at fifty. Biometrika.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |