+ - 0:00:00
Notes for current slide
Notes for next slide

Computational Statistics

Lecture 12

Yixuan Qiu

2023-12-06

1 / 27

Simulation and Sampling

2 / 27

Today's Topics

  • Markov chain Monte Carlo (MCMC)

  • Metropolis-Hastings algorithm

  • Gibbs sampler

3 / 27

Problem

Given unnormalized probability density/mass function \(u(x)\propto \pi(x)\), estimate high-dimensional integral

$$I=\mathbb{E}_X[f(X)]=\int f(x)\pi(x)\mathrm{d}x,\quad X\sim \pi(x).$$

4 / 27

Problem

Given unnormalized probability density/mass function \(u(x)\propto \pi(x)\), estimate high-dimensional integral

$$I=\mathbb{E}_X[f(X)]=\int f(x)\pi(x)\mathrm{d}x,\quad X\sim \pi(x).$$

We have already introduced inverse transform algorithm, rejection sampling, and importance sampling.

4 / 27

Problem

Given unnormalized probability density/mass function \(u(x)\propto \pi(x)\), estimate high-dimensional integral

$$I=\mathbb{E}_X[f(X)]=\int f(x)\pi(x)\mathrm{d}x,\quad X\sim \pi(x).$$

We have already introduced inverse transform algorithm, rejection sampling, and importance sampling.

But they are not "general" enough.

4 / 27

MCMC

Markov chain Monte Carlo (MCMC) tackles this problem by constructing a Markov chain

$$X_0\rightarrow X_1\rightarrow \cdots \rightarrow X_N,$$

and approximate the integral as

$$\hat{I}=\frac{1}{N}\sum_{i=1}^{N}f(X_{i}).$$

We hope that \(\hat{I}\rightarrow I=\mathbb{E}_X[f(X)]\) almost surely as \(N\rightarrow \infty\).

5 / 27

MCMC

Unlike previously introduced Monte Carlo methods, the Markov chain \(X_0\rightarrow\cdots\rightarrow X_N\) in general contains dependent sample points \(\{X_i\}\).

6 / 27

MCMC

Unlike previously introduced Monte Carlo methods, the Markov chain \(X_0\rightarrow\cdots\rightarrow X_N\) in general contains dependent sample points \(\{X_i\}\).

The nature of Markov chains implies that \(X_{t+1}\) should be only based on \(X_t\) (with other independent sources of randomness) and not on \(X_0,\ldots,X_{t-1}\).

6 / 27

MCMC

Unlike previously introduced Monte Carlo methods, the Markov chain \(X_0\rightarrow\cdots\rightarrow X_N\) in general contains dependent sample points \(\{X_i\}\).

The nature of Markov chains implies that \(X_{t+1}\) should be only based on \(X_t\) (with other independent sources of randomness) and not on \(X_0,\ldots,X_{t-1}\).

Therefore, with a given starting point \(X_0\), the whole Markov chain can be specified via a conditional distribution \(\mathcal{T}(y|x)\):

$$X_1\sim \mathcal{T}(\cdot|X_0),\quad X_2\sim \mathcal{T}(\cdot|X_1),\quad\ldots$$

6 / 27

Transition Kernel

  • Such a conditional distribution is typically called a transition kernel

  • Denoted as \(\mathcal{T}(x\rightarrow y)\equiv \mathcal{T}(y|x)\)

  • Think of \(x\) as the current position/state, and \(y\) the next position/state in the Markov chain

7 / 27

Transition Kernel

  • Such a conditional distribution is typically called a transition kernel

  • Denoted as \(\mathcal{T}(x\rightarrow y)\equiv \mathcal{T}(y|x)\)

  • Think of \(x\) as the current position/state, and \(y\) the next position/state in the Markov chain

  • Our target is to find such a \(\mathcal{T}(x\rightarrow y)\) such that \(\hat{I}\rightarrow I\ \ a.s.\) holds

7 / 27

Regularity Conditions

We first require \(\mathcal{T}(x\rightarrow y)\) to satisfy some mild regularity conditions:

  • Irreducible: it is possible to get to any state from any state

  • Aperiodic: the chain cannot get trapped in cycles

  • Positive recurrent: revists every neighborhood infinitely often

These conditions are rather weak and can be easily satisfied.

8 / 27

Invariant/Stationary Distribution

The "real" condition that matters is that \(\pi(x)\) is an invariant/stationary distribution of \(\mathcal{T}(x\rightarrow y)\), meaning that

$$\pi(y)=\int\pi(x)\mathcal{T}(x\rightarrow y)\mathrm{d}x.$$

9 / 27

Invariant/Stationary Distribution

The "real" condition that matters is that \(\pi(x)\) is an invariant/stationary distribution of \(\mathcal{T}(x\rightarrow y)\), meaning that

$$\pi(y)=\int\pi(x)\mathcal{T}(x\rightarrow y)\mathrm{d}x.$$

  • Combined with the regularity conditions, we have \(\hat{I}\rightarrow I\ \ a.s.\)

  • This property is called the ergodicity of Markov chains

  • \(X_0\) can be chosen arbitrarily

9 / 27

Detailed Balance

  • How to satisfy the stationary distribution condition?

  • We typically consider a stronger, but easier-to-verify condition called detailed balance: $$\pi(x)\mathcal{T}(x\rightarrow y)=\pi(y)\mathcal{T}(y\rightarrow x)$$

  • Easy to verify that it implies the stationary distribution condition

  • Sufficient but not necessary

10 / 27

Important Algorithms

  • Metropolis-Hastings algorithm

  • Gibbs sampler

  • Metropolis-adjusted Langevin algorithm

  • Hamiltonian Monte Carlo

11 / 27

Metropolis-Hastings Algorithm

12 / 27

Metropolis-Hastings Algorithm

  • Listed as one of the "Top 10 Algorithms" in the 20th century [1]

  • Foundation of many other MCMC algorithms

  • Simple and widely used

13 / 27

Metropolis-Hastings Algorithm

  • Choose an initial value \(X_0\)

  • Choose an arbitrary proposal distribution \(q(y|x)\)

  • Given \(X_t\), propose a new point \(Y\sim q(\cdot|X_t)\)

  • Define $$\alpha=\min\left(1,\frac{\pi(Y)q(X_{t}|Y)}{\pi(X_{t})q(Y|X_{t})}\right)$$

  • With probability \(\alpha\) set \(X_{t+1}=Y\); otherwise set \(X_{t+1}=X_t\)

14 / 27

Some (Important) Remarks

  • We do not really need to know \(\pi(x)\)

  • The unnormalized density \(u(x)\propto \pi(x)\) suffices, since $$\alpha=\min\left(1,\frac{\pi(Y)q(X_{t}|Y)}{\pi(X_{t})q(Y|X_{t})}\right)=\min\left(1,\frac{u(Y)q(X_{t}|Y)}{u(X_{t})q(Y|X_{t})}\right)$$

  • \(q(y|x)\) can be chosen almost completely freely (but some would be better)

  • These properties show the real power of M-H

15 / 27

Some (Important) Remarks

  • If the "acceptance" step \(X_{t+1}=Y\) fails, we still need to make \(X_{t+1}=X_t\) (i.e., repeat the previous position once)

  • This is different from the rejection sampling

16 / 27

Transition Kernel

  • Let \(\alpha(x,y)=\min\left(1,\frac{u(y)q(x|y)}{u(x)q(y|x)}\right)\)

  • Probability of proposing \(x_{t+1}\) and accepting it is \(q(x_{t+1}|x_t)\alpha(x_t,x_{t+1})\)

  • Probability of accepting a move is \(\alpha(x_t)=\int q(y|x_t)\alpha(x_t,y)\mathrm{d}y\)

  • The transition kernel is $$\mathcal{T}(x_{t}\rightarrow x_{t+1})=q(x_{t+1}|x_{t})\alpha(x_{t},x_{t+1})+(1-\alpha(x_{t}))\delta(x_{t+1}=x_{t}),$$ where \(\delta(\cdot)\) is a Dirac measure

17 / 27

Verifying Detailed Balance

  • We need to verify $$\pi(x_t)\mathcal{T}(x_t\rightarrow x_{t+1})=\pi(x_{t+1})\mathcal{T}(x_{t+1}\rightarrow x_t)$$

  • First term of LHS is $$\begin{align*} & \pi(x_{t})q(x_{t+1}|x_{t})\alpha(x_{t},x_{t+1})\\ =\, & \pi(x_{t})q(x_{t+1}|x_{t})\cdot\min\left(1,\frac{\pi(x_{t+1})q(x_{t}|x_{t+1})}{\pi(x_{t})q(x_{t+1}|x_{t})}\right)\\ =\, & \min\left(\pi(x_{t})q(x_{t+1}|x_{t}),\pi(x_{t+1})q(x_{t}|x_{t+1})\right) \end{align*}$$

  • Second term of LHS is $$\pi(x_{t})(1-\alpha(x_{t}))\delta(x_{t}=x_{t+1})=\pi(x_{t+1})(1-\alpha(x_{t+1}))\delta(x_{t+1}=x_{t})$$

18 / 27

Random Walk M-H

  • The Metropolis-Hastings algorithm gives full freedom in choosing the proposal distribution \(q(y|x)\)

  • One commonly-used implementation of the M-H algorithm is the random walk M-H

  • It makes \(q(y|x)\) a normal distribution \(N(x,\sigma^2 I)\)

  • The acceptance probability simplifies to $$\alpha=\min\left(1,\frac{u(Y)}{u(X_{t})}\right)$$

  • \(\sigma^2\) is a hyperparameter

19 / 27

Extensions

  • Do we have a better choice than random walk M-H?

  • How to design new MCMC algorithm based on M-H?

  • We will see an excellent example based on the Langevin algorithm

  • A good review of the history of M-H can be found in [2]

  • Also introduces a lot of extensions of M-H

21 / 27

Gibbs Sampler

22 / 27

Gibbs Sampler

  • Suppose the variable \(X\) we want to sample can be partitioned as \(X=(X_1,\ldots,X_d)\)

  • Each component \(X_i\) can be a scalar or a vector

  • The Gibbs sampler has the following iterations:

    • Select an index \(i\) from \(\{1,\ldots,d\}\) randomly or deterministically

    • Sample \(X_i^{(k+1)}\sim \pi(x_i|X_{-i}^{(k)})\)

    • Set \(X^{(k+1)}=(X_1^{(k)},\ldots,X_i^{(k+1)},\ldots,X_d^{(k)})\)

23 / 27

Transition Kernel

  • Suppose index \(i\) is selected with probability \(\rho_i\)

  • \(X^{(k+1)}\) and \(X^{(k)}\) differ only in component \(X_i\)

  • \(\mathcal{T}(x^{(k)}\rightarrow x^{(k+1)})=\rho_i\pi\left(x_i^{(k+1)}|x_{-i}^{(k)}\right)\)

24 / 27

Detailed Balance

  • We need to verify $$\pi(x^{(k)})\mathcal{T}(x^{(k)}\rightarrow x^{(k+1)})=\pi(x^{(k+1)})\mathcal{T}(x^{(k+1)}\rightarrow x^{(k)})$$

  • Note that \(x_{-i}^{(k)}=x_{-i}^{(k+1)}\)

  • LHS is $$\begin{align*} & \pi(x^{(k)})\rho_{i}\pi\left(x_{i}^{(k+1)}|x_{-i}^{(k)}\right)\\ =\, & \pi(x_{-i}^{(k)})\pi\left(x_{i}^{(k)}|x_{-i}^{(k)}\right)\rho_{i}\pi\left(x_{i}^{(k+1)}|x_{-i}^{(k)}\right)\\ =\, & \pi(x_{-i}^{(k+1)})\pi\left(x_{i}^{(k)}|x_{-i}^{(k+1)}\right)\rho_{i}\pi\left(x_{i}^{(k+1)}|x_{-i}^{(k+1)}\right) \end{align*}$$

  • By symmetry, equal to RHS

25 / 27

References

[1] Jack Dongarra & Francis Sullivan (2000). Guest Editors' Introduction to the top 10 algorithms. Computing in Science & Engineering.

[2] D. B. Dunson & J. E. Johndrow (2020). The Hastings algorithm at fifty. Biometrika.

27 / 27

Simulation and Sampling

2 / 27
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow