Computational Statistics
Lecture 12
Yixuan Qiu
2023-12-06
1 / 27

Simulation and Sampling2 / 27

Today's Topics

Markov chain Monte Carlo (MCMC)
Metropolis-Hastings algorithm
Gibbs sampler

3 / 27

Problem

Given unnormalized probability density/mass function $u(x)\propto \pi(x)$, estimate high-dimensional integral

$$I=\mathbb{E}_X[f(X)]=\int f(x)\pi(x)\mathrm{d}x,\quad X\sim \pi(x).$$

4 / 27

Problem

Given unnormalized probability density/mass function $u(x)\propto \pi(x)$, estimate high-dimensional integral

$$I=\mathbb{E}_X[f(X)]=\int f(x)\pi(x)\mathrm{d}x,\quad X\sim \pi(x).$$

We have already introduced inverse transform algorithm, rejection sampling, and importance sampling.

4 / 27

Problem

Given unnormalized probability density/mass function $u(x)\propto \pi(x)$, estimate high-dimensional integral

$$I=\mathbb{E}_X[f(X)]=\int f(x)\pi(x)\mathrm{d}x,\quad X\sim \pi(x).$$

We have already introduced inverse transform algorithm, rejection sampling, and importance sampling.

But they are not "general" enough.

4 / 27

MCMC

Markov chain Monte Carlo (MCMC) tackles this problem by constructing a Markov chain

$$X_0\rightarrow X_1\rightarrow \cdots \rightarrow X_N,$$

and approximate the integral as

$$\hat{I}=\frac{1}{N}\sum_{i=1}^{N}f(X_{i}).$$

We hope that $\hat{I}\rightarrow I=\mathbb{E}_X[f(X)]$ almost surely as $N\rightarrow \infty$.

5 / 27

MCMC

Unlike previously introduced Monte Carlo methods, the Markov chain $X_0\rightarrow\cdots\rightarrow X_N$ in general contains dependent sample points $\{X_i\}$.

6 / 27

MCMC

Unlike previously introduced Monte Carlo methods, the Markov chain $X_0\rightarrow\cdots\rightarrow X_N$ in general contains dependent sample points $\{X_i\}$.

The nature of Markov chains implies that $X_{t+1}$ should be only based on $X_t$ (with other independent sources of randomness) and not on $X_0,\ldots,X_{t-1}$.

6 / 27

MCMC

Unlike previously introduced Monte Carlo methods, the Markov chain $X_0\rightarrow\cdots\rightarrow X_N$ in general contains dependent sample points $\{X_i\}$.

The nature of Markov chains implies that $X_{t+1}$ should be only based on $X_t$ (with other independent sources of randomness) and not on $X_0,\ldots,X_{t-1}$.

Therefore, with a given starting point $X_0$, the whole Markov chain can be specified via a conditional distribution $\mathcal{T}(y|x)$:

$$X_1\sim \mathcal{T}(\cdot|X_0),\quad X_2\sim \mathcal{T}(\cdot|X_1),\quad\ldots$$

6 / 27

Transition Kernel

Such a conditional distribution is typically called a transition kernel
Denoted as $\mathcal{T}(x\rightarrow y)\equiv \mathcal{T}(y|x)$
Think of $x$ as the current position/state, and $y$ the next position/state in the Markov chain

7 / 27

Transition Kernel

Such a conditional distribution is typically called a transition kernel
Denoted as $\mathcal{T}(x\rightarrow y)\equiv \mathcal{T}(y|x)$
Think of $x$ as the current position/state, and $y$ the next position/state in the Markov chain
Our target is to find such a $\mathcal{T}(x\rightarrow y)$ such that $\hat{I}\rightarrow I\ \ a.s.$ holds

7 / 27

Regularity Conditions

We first require $\mathcal{T}(x\rightarrow y)$ to satisfy some mild regularity conditions:

Irreducible: it is possible to get to any state from any state
Aperiodic: the chain cannot get trapped in cycles
Positive recurrent: revists every neighborhood infinitely often

These conditions are rather weak and can be easily satisfied.

8 / 27

Invariant/Stationary Distribution

The "real" condition that matters is that $\pi(x)$ is an invariant/stationary distribution of $\mathcal{T}(x\rightarrow y)$, meaning that

$$\pi(y)=\int\pi(x)\mathcal{T}(x\rightarrow y)\mathrm{d}x.$$

9 / 27

Invariant/Stationary Distribution

The "real" condition that matters is that $\pi(x)$ is an invariant/stationary distribution of $\mathcal{T}(x\rightarrow y)$, meaning that

$$\pi(y)=\int\pi(x)\mathcal{T}(x\rightarrow y)\mathrm{d}x.$$

Combined with the regularity conditions, we have $\hat{I}\rightarrow I\ \ a.s.$
This property is called the ergodicity of Markov chains
$X_0$ can be chosen arbitrarily

9 / 27

Detailed Balance

How to satisfy the stationary distribution condition?
We typically consider a stronger, but easier-to-verify condition called detailed balance: $$\pi(x)\mathcal{T}(x\rightarrow y)=\pi(y)\mathcal{T}(y\rightarrow x)$$
Easy to verify that it implies the stationary distribution condition
Sufficient but not necessary

10 / 27

Important Algorithms

Metropolis-Hastings algorithm
Gibbs sampler
Metropolis-adjusted Langevin algorithm
Hamiltonian Monte Carlo

11 / 27

Metropolis-Hastings Algorithm12 / 27

Metropolis-Hastings Algorithm

Listed as one of the "Top 10 Algorithms" in the 20th century ^[1]
Foundation of many other MCMC algorithms
Simple and widely used

13 / 27

Metropolis-Hastings Algorithm

Choose an initial value $X_0$
Choose an arbitrary proposal distribution $q(y|x)$
Given $X_t$, propose a new point $Y\sim q(\cdot|X_t)$
Define $$\alpha=\min\left(1,\frac{\pi(Y)q(X_{t}|Y)}{\pi(X_{t})q(Y|X_{t})}\right)$$
With probability $\alpha$ set $X_{t+1}=Y$; otherwise set $X_{t+1}=X_t$

14 / 27

Some (Important) Remarks

We do not really need to know $\pi(x)$
The unnormalized density $u(x)\propto \pi(x)$ suffices, since $$\alpha=\min\left(1,\frac{\pi(Y)q(X_{t}|Y)}{\pi(X_{t})q(Y|X_{t})}\right)=\min\left(1,\frac{u(Y)q(X_{t}|Y)}{u(X_{t})q(Y|X_{t})}\right)$$
$q(y|x)$ can be chosen almost completely freely (but some would be better)
These properties show the real power of M-H

15 / 27

Some (Important) Remarks

If the "acceptance" step $X_{t+1}=Y$ fails, we still need to make $X_{t+1}=X_t$ (i.e., repeat the previous position once)
This is different from the rejection sampling

16 / 27

Transition Kernel

Let $\alpha(x,y)=\min\left(1,\frac{u(y)q(x|y)}{u(x)q(y|x)}\right)$
Probability of proposing $x_{t+1}$ and accepting it is $q(x_{t+1}|x_t)\alpha(x_t,x_{t+1})$
Probability of accepting a move is $\alpha(x_t)=\int q(y|x_t)\alpha(x_t,y)\mathrm{d}y$
The transition kernel is $$\mathcal{T}(x_{t}\rightarrow x_{t+1})=q(x_{t+1}|x_{t})\alpha(x_{t},x_{t+1})+(1-\alpha(x_{t}))\delta(x_{t+1}=x_{t}),$$ where $\delta(\cdot)$ is a Dirac measure

17 / 27

Verifying Detailed Balance

We need to verify $$\pi(x_t)\mathcal{T}(x_t\rightarrow x_{t+1})=\pi(x_{t+1})\mathcal{T}(x_{t+1}\rightarrow x_t)$$
First term of LHS is $$\begin{align*} & \pi(x_{t})q(x_{t+1}|x_{t})\alpha(x_{t},x_{t+1})\\ =\, & \pi(x_{t})q(x_{t+1}|x_{t})\cdot\min\left(1,\frac{\pi(x_{t+1})q(x_{t}|x_{t+1})}{\pi(x_{t})q(x_{t+1}|x_{t})}\right)\\ =\, & \min\left(\pi(x_{t})q(x_{t+1}|x_{t}),\pi(x_{t+1})q(x_{t}|x_{t+1})\right) \end{align*}$$
Second term of LHS is $$\pi(x_{t})(1-\alpha(x_{t}))\delta(x_{t}=x_{t+1})=\pi(x_{t+1})(1-\alpha(x_{t+1}))\delta(x_{t+1}=x_{t})$$

18 / 27

Random Walk M-H

The Metropolis-Hastings algorithm gives full freedom in choosing the proposal distribution $q(y|x)$
One commonly-used implementation of the M-H algorithm is the random walk M-H
It makes $q(y|x)$ a normal distribution $N(x,\sigma^2 I)$
The acceptance probability simplifies to $$\alpha=\min\left(1,\frac{u(Y)}{u(X_{t})}\right)$$
$\sigma^2$ is a hyperparameter

19 / 27

Visualization

https://chi-feng.github.io/mcmc-demo/app.html?algorithm=RandomWalkMH&target=banana

20 / 27

Extensions

Do we have a better choice than random walk M-H?
How to design new MCMC algorithm based on M-H?
We will see an excellent example based on the Langevin algorithm
A good review of the history of M-H can be found in [2]
Also introduces a lot of extensions of M-H

21 / 27

Gibbs Sampler22 / 27

Gibbs Sampler

Suppose the variable $X$ we want to sample can be partitioned as $X=(X_1,\ldots,X_d)$
Each component $X_i$ can be a scalar or a vector
The Gibbs sampler has the following iterations:
- Select an index $i$ from $\{1,\ldots,d\}$ randomly or deterministically
- Sample $X_i^{(k+1)}\sim \pi(x_i|X_{-i}^{(k)})$
- Set $X^{(k+1)}=(X_1^{(k)},\ldots,X_i^{(k+1)},\ldots,X_d^{(k)})$

23 / 27

Transition Kernel

Suppose index $i$ is selected with probability $\rho_i$
$X^{(k+1)}$ and $X^{(k)}$ differ only in component $X_i$
$\mathcal{T}(x^{(k)}\rightarrow x^{(k+1)})=\rho_i\pi\left(x_i^{(k+1)}|x_{-i}^{(k)}\right)$

24 / 27

Detailed Balance

We need to verify $$\pi(x^{(k)})\mathcal{T}(x^{(k)}\rightarrow x^{(k+1)})=\pi(x^{(k+1)})\mathcal{T}(x^{(k+1)}\rightarrow x^{(k)})$$
Note that $x_{-i}^{(k)}=x_{-i}^{(k+1)}$
LHS is $$\begin{align*} & \pi(x^{(k)})\rho_{i}\pi\left(x_{i}^{(k+1)}|x_{-i}^{(k)}\right)\\ =\, & \pi(x_{-i}^{(k)})\pi\left(x_{i}^{(k)}|x_{-i}^{(k)}\right)\rho_{i}\pi\left(x_{i}^{(k+1)}|x_{-i}^{(k)}\right)\\ =\, & \pi(x_{-i}^{(k+1)})\pi\left(x_{i}^{(k)}|x_{-i}^{(k+1)}\right)\rho_{i}\pi\left(x_{i}^{(k+1)}|x_{-i}^{(k+1)}\right) \end{align*}$$
By symmetry, equal to RHS

25 / 27

Visualization

https://chi-feng.github.io/mcmc-demo/app.html?algorithm=GibbsSampling&target=banana

26 / 27

References

[1] Jack Dongarra & Francis Sullivan (2000). Guest Editors' Introduction to the top 10 algorithms. Computing in Science & Engineering.

[2] D. B. Dunson & J. E. Johndrow (2020). The Hastings algorithm at fifty. Biometrika.

27 / 27

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Computational Statistics

Lecture 12

Yixuan Qiu

2023-12-06

Simulation and Sampling

Today's Topics

Problem

Problem

Problem

MCMC

MCMC

MCMC

MCMC

Transition Kernel

Transition Kernel

Regularity Conditions

Invariant/Stationary Distribution

Invariant/Stationary Distribution

Detailed Balance

Important Algorithms

Metropolis-Hastings Algorithm

Metropolis-Hastings Algorithm

Metropolis-Hastings Algorithm

Some (Important) Remarks

Some (Important) Remarks

Transition Kernel

Verifying Detailed Balance

Random Walk M-H

Visualization

Extensions

Gibbs Sampler

Gibbs Sampler

Transition Kernel

Detailed Balance

Visualization

References

Simulation and Sampling

Help