Loading [MathJax]/jax/output/CommonHTML/jax.js
+ - 0:00:00
Notes for current slide
Notes for next slide

Computational Statistics

Lecture 12

Yixuan Qiu

2023-12-06

1 / 27

Simulation and Sampling

2 / 27

Today's Topics

  • Markov chain Monte Carlo (MCMC)

  • Metropolis-Hastings algorithm

  • Gibbs sampler

3 / 27

Problem

Given unnormalized probability density/mass function u(x)π(x), estimate high-dimensional integral

I=EX[f(X)]=f(x)π(x)dx,Xπ(x).

4 / 27

Problem

Given unnormalized probability density/mass function u(x)π(x), estimate high-dimensional integral

I=EX[f(X)]=f(x)π(x)dx,Xπ(x).

We have already introduced inverse transform algorithm, rejection sampling, and importance sampling.

4 / 27

Problem

Given unnormalized probability density/mass function u(x)π(x), estimate high-dimensional integral

I=EX[f(X)]=f(x)π(x)dx,Xπ(x).

We have already introduced inverse transform algorithm, rejection sampling, and importance sampling.

But they are not "general" enough.

4 / 27

MCMC

Markov chain Monte Carlo (MCMC) tackles this problem by constructing a Markov chain

X0X1XN,

and approximate the integral as

ˆI=1NNi=1f(Xi).

We hope that ˆII=EX[f(X)] almost surely as N.

5 / 27

MCMC

Unlike previously introduced Monte Carlo methods, the Markov chain X0XN in general contains dependent sample points {Xi}.

6 / 27

MCMC

Unlike previously introduced Monte Carlo methods, the Markov chain X0XN in general contains dependent sample points {Xi}.

The nature of Markov chains implies that Xt+1 should be only based on Xt (with other independent sources of randomness) and not on X0,,Xt1.

6 / 27

MCMC

Unlike previously introduced Monte Carlo methods, the Markov chain X0XN in general contains dependent sample points {Xi}.

The nature of Markov chains implies that Xt+1 should be only based on Xt (with other independent sources of randomness) and not on X0,,Xt1.

Therefore, with a given starting point X0, the whole Markov chain can be specified via a conditional distribution T(y|x):

X1T(|X0),X2T(|X1),

6 / 27

Transition Kernel

  • Such a conditional distribution is typically called a transition kernel

  • Denoted as T(xy)T(y|x)

  • Think of x as the current position/state, and y the next position/state in the Markov chain

7 / 27

Transition Kernel

  • Such a conditional distribution is typically called a transition kernel

  • Denoted as T(xy)T(y|x)

  • Think of x as the current position/state, and y the next position/state in the Markov chain

  • Our target is to find such a T(xy) such that ˆII  a.s. holds

7 / 27

Regularity Conditions

We first require T(xy) to satisfy some mild regularity conditions:

  • Irreducible: it is possible to get to any state from any state

  • Aperiodic: the chain cannot get trapped in cycles

  • Positive recurrent: revists every neighborhood infinitely often

These conditions are rather weak and can be easily satisfied.

8 / 27

Invariant/Stationary Distribution

The "real" condition that matters is that π(x) is an invariant/stationary distribution of T(xy), meaning that

π(y)=π(x)T(xy)dx.

9 / 27

Invariant/Stationary Distribution

The "real" condition that matters is that π(x) is an invariant/stationary distribution of T(xy), meaning that

π(y)=π(x)T(xy)dx.

  • Combined with the regularity conditions, we have ˆII  a.s.

  • This property is called the ergodicity of Markov chains

  • X0 can be chosen arbitrarily

9 / 27

Detailed Balance

  • How to satisfy the stationary distribution condition?

  • We typically consider a stronger, but easier-to-verify condition called detailed balance: π(x)T(xy)=π(y)T(yx)

  • Easy to verify that it implies the stationary distribution condition

  • Sufficient but not necessary

10 / 27

Important Algorithms

  • Metropolis-Hastings algorithm

  • Gibbs sampler

  • Metropolis-adjusted Langevin algorithm

  • Hamiltonian Monte Carlo

11 / 27

Metropolis-Hastings Algorithm

12 / 27

Metropolis-Hastings Algorithm

  • Listed as one of the "Top 10 Algorithms" in the 20th century [1]

  • Foundation of many other MCMC algorithms

  • Simple and widely used

13 / 27

Metropolis-Hastings Algorithm

  • Choose an initial value X0

  • Choose an arbitrary proposal distribution q(y|x)

  • Given Xt, propose a new point Yq(|Xt)

  • Define α=min(1,π(Y)q(Xt|Y)π(Xt)q(Y|Xt))

  • With probability α set Xt+1=Y; otherwise set Xt+1=Xt

14 / 27

Some (Important) Remarks

  • We do not really need to know π(x)

  • The unnormalized density u(x)π(x) suffices, since α=min(1,π(Y)q(Xt|Y)π(Xt)q(Y|Xt))=min(1,u(Y)q(Xt|Y)u(Xt)q(Y|Xt))

  • q(y|x) can be chosen almost completely freely (but some would be better)

  • These properties show the real power of M-H

15 / 27

Some (Important) Remarks

  • If the "acceptance" step Xt+1=Y fails, we still need to make Xt+1=Xt (i.e., repeat the previous position once)

  • This is different from the rejection sampling

16 / 27

Transition Kernel

  • Let α(x,y)=min(1,u(y)q(x|y)u(x)q(y|x))

  • Probability of proposing xt+1 and accepting it is q(xt+1|xt)α(xt,xt+1)

  • Probability of accepting a move is α(xt)=q(y|xt)α(xt,y)dy

  • The transition kernel is T(xtxt+1)=q(xt+1|xt)α(xt,xt+1)+(1α(xt))δ(xt+1=xt), where δ() is a Dirac measure

17 / 27

Verifying Detailed Balance

  • We need to verify π(xt)T(xtxt+1)=π(xt+1)T(xt+1xt)

  • First term of LHS is π(xt)q(xt+1|xt)α(xt,xt+1)=π(xt)q(xt+1|xt)min(1,π(xt+1)q(xt|xt+1)π(xt)q(xt+1|xt))=min(π(xt)q(xt+1|xt),π(xt+1)q(xt|xt+1))

  • Second term of LHS is π(xt)(1α(xt))δ(xt=xt+1)=π(xt+1)(1α(xt+1))δ(xt+1=xt)

18 / 27

Random Walk M-H

  • The Metropolis-Hastings algorithm gives full freedom in choosing the proposal distribution q(y|x)

  • One commonly-used implementation of the M-H algorithm is the random walk M-H

  • It makes q(y|x) a normal distribution N(x,σ2I)

  • The acceptance probability simplifies to α=min(1,u(Y)u(Xt))

  • σ2 is a hyperparameter

19 / 27

Extensions

  • Do we have a better choice than random walk M-H?

  • How to design new MCMC algorithm based on M-H?

  • We will see an excellent example based on the Langevin algorithm

  • A good review of the history of M-H can be found in [2]

  • Also introduces a lot of extensions of M-H

21 / 27

Gibbs Sampler

22 / 27

Gibbs Sampler

  • Suppose the variable X we want to sample can be partitioned as X=(X1,,Xd)

  • Each component Xi can be a scalar or a vector

  • The Gibbs sampler has the following iterations:

    • Select an index i from {1,,d} randomly or deterministically

    • Sample X(k+1)iπ(xi|X(k)i)

    • Set X(k+1)=(X(k)1,,X(k+1)i,,X(k)d)

23 / 27

Transition Kernel

  • Suppose index i is selected with probability ρi

  • X(k+1) and X(k) differ only in component Xi

  • T(x(k)x(k+1))=ρiπ(x(k+1)i|x(k)i)

24 / 27

Detailed Balance

  • We need to verify π(x(k))T(x(k)x(k+1))=π(x(k+1))T(x(k+1)x(k))

  • Note that x(k)i=x(k+1)i

  • LHS is π(x(k))ρiπ(x(k+1)i|x(k)i)=π(x(k)i)π(x(k)i|x(k)i)ρiπ(x(k+1)i|x(k)i)=π(x(k+1)i)π(x(k)i|x(k+1)i)ρiπ(x(k+1)i|x(k+1)i)

  • By symmetry, equal to RHS

25 / 27

References

[1] Jack Dongarra & Francis Sullivan (2000). Guest Editors' Introduction to the top 10 algorithms. Computing in Science & Engineering.

[2] D. B. Dunson & J. E. Johndrow (2020). The Hastings algorithm at fifty. Biometrika.

27 / 27

Simulation and Sampling

2 / 27
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow