Computational Statistics
Lecture 13
Yixuan Qiu
2023-12-13
1 / 29

Simulation and Sampling2 / 29

Today's Topics

Langevin algorithm
Hamiltonian Monte Carlo

3 / 29

Langevin Algorithm4 / 29

Langevin Dynamics

The Langevin dynamics, or Langevin diffusion process, is the solution to the following stochastic differential equation (SDE) on :

where is the gradient of some smooth function , and is the -dimensional Brownian motion.

5 / 29

Langevin Dynamics

The SDE can be understood as the continuous-time limit of the following iteration:

It is known that , so we can rewrite the formula above as

where and is independent of .

It is very similar to the gradient descent iteration, except for the additional noise term.

6 / 29

Invariant Distribution

An important property of the Langevin dynamics is that under mild conditions, it has an explicit invariant distribution. Theorem 2.1 of [1]:

Suppose that is continuously differentiable, and there exist some constants such that

Let be the density function of . Then , where is the total variation distance between two densities and , and .

7 / 29

Invariant Distribution

This means that no matter what the initial distribution is, will eventually follow the distribution .

In other words, if we are interested in sampling from some target distribution , we can simulate the Langevin diffusion process of some particles, starting from an arbitrary initial distribution.

This is the foundation of the Langevin Monte Carlo algorithm.

8 / 29

Convergence Rate ^[2]

In addition, if is -strongly-convex and -smooth, then

where is the divergence between two densities and .

This means that if is both smooth and strongly convex, then the distribution of converges to exponentially fast.

9 / 29

Unadjusted Langevin Algorithm

A natural and obvious method to simulate the SDE is using the discretized iteration:

This is called the unadjusted Langevin algorithm (ULA).

10 / 29

Convergence of ULA ^[3]

Let denote the distribution of with step size . Suppose that is -strongly-convex and -smooth, and set . Then with

we have

where is the KL divergenve, and is the 2-Wasserstein distance.

11 / 29

Metropolis-Adjusted Langevin Algorithm

Due to the discretization, in general the distribution of generated by ULA would NOT converge to .

However, ULA provides a good proposal distribution for the Metropolis-Hastings algorithm.

Random walk Metropolis:

Propose
Define
Set with probability , otherwise

12 / 29

Metropolis-Adjusted Langevin Algorithm

Metropolis-Adjusted Langevin Algorithm (MALA):

Propose
Define
Set with probability , otherwise

13 / 29

Convergence Rate of MALA ^[4]

For smooth and strongly convex , [4] shows that MALA has advantages over ULA and random walk Metropolis.

Roughly speaking, to obtain samples with total variation error at most , ULA requires steps from a warm start, whereas MALA only requires . Here is the condition number.

As a comparison, the random walk Metropolis requires steps.

14 / 29

Underdamped Langevin Diffusion

Another variant of Langevin diffusion, called underdamped Langevin diffusion, solves the following SDE:

Here we introduce an auxiliary process . Interestingly, has a joint invariant distribution of the form

where is the target distribution.

15 / 29

Underdamped Langevin Diffusion

This means that if the distribution of converges to the invariant distribution, then eventually follows , and converges to a normal distribution independent of .

The discretized version is

16 / 29

Underdamped Langevin Diffusion

This means that if the distribution of converges to the invariant distribution, then eventually follows , and converges to a normal distribution independent of .

The discretized version is

However, for the underdamped Langevin diffusion we do not have an obvious Metropolis correction method.

16 / 29

Hamiltonian Monte Carlo17 / 29

Hamiltonian Dynamics

Imagine a particle with position , velocity , and unit mass
Define the total energy as
Also called the Hamiltonian of the particle

18 / 29

Hamiltonian Dynamics

The Hamiltonian dynamics are a set of differential equations:
We immediately find that
With the given form , we have

19 / 29

Hamiltonian Dynamics

Starting from an initial value , denote the solution to the equations by
Properties
- is a deterministic operator, i.e., no random variables involved so far

20 / 29

Stationarity Theorem ^[5]

Suppose , then for any ,

21 / 29

Stationarity Theorem ^[5]

Suppose , then for any ,

implies that and are independent with and
The theorem indicates that the operator does not change the distribution of the random vectors
Marginally, does not change the distribution of for any that is independent of

21 / 29

Idealized HMC

If we define a Markov chain with the following transition algorithm from :

Simulate independent of
Let

Then this transition kernel satisfies where .

22 / 29

Idealized HMC

For :

Sample
Set

Then under mild conditions on we obtain the ergodicity of .

Note that is essentially determined by , and we assume the solution to the differential equations can be solved exactly.

23 / 29

Convergence Rate ^[6]

Assume that is -strongly-convex and -smooth. Let be the marginal distribution of from the idealized HMC algorithm. Then for any with , , and , we have where is the 2-Wasserstein distance between two distributions.

24 / 29

Some Remarks

The previous theorem means that if is both smooth and strongly convex, then the distribution of converges to exponentially fast.

At first glance it is based on the same assumption as in the Langevin algorithm, and the result is also similar.

However, empirical results show that HMC tends to perform better for more complicated distributions, e.g. multimodal distributions.

25 / 29

Discretized HMC

In reality, of course, we cannot solve the differential equations exactly
Need to discretize the solution
Let , . For ,
Propose as the next state
and are tuning parameters

26 / 29

Discretized HMC

But now we lose the preservation of Hamiltonian
Need a Metropolis-Hastings correction!
Acceptance probability
With probability set ; otherwise set

27 / 29

References

[1] Gareth O. Roberts and Richard L. Tweedie (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli.

[2] Arnak S. Dalalyan (2017). Theoretical guarantees for approximate sampling from smooth and log‐concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[3] Xiang Cheng and Peter Bartlett (2018). Convergence of Langevin MCMC in KL-divergence. Algorithmic Learning Theory.

[4] Raaz Dwivedi et al. (2019). Log-concave sampling: Metropolis-Hastings algorithms are fast. Journal of Machine Learning Research.

28 / 29

References

[5] Nisheeth K. Vishnoi (2021). An introduction to Hamiltonian Monte Carlo method for sampling. arXiv:2108.12107.

[6] Zongchen Chen & Santosh S. Vempala (2022). Optimal convergence rate of hamiltonian monte carlo for strongly logconcave distributions. Theory of Computing.

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Computational Statistics

Lecture 13

Yixuan Qiu

2023-12-13

Simulation and Sampling

Today's Topics

Langevin Algorithm

Langevin Dynamics

Langevin Dynamics

Invariant Distribution

Invariant Distribution

Convergence Rate [2]

Unadjusted Langevin Algorithm

Convergence of ULA [3]

Metropolis-Adjusted Langevin Algorithm

Metropolis-Adjusted Langevin Algorithm

Convergence Rate of MALA [4]

Underdamped Langevin Diffusion

Underdamped Langevin Diffusion

Underdamped Langevin Diffusion

Hamiltonian Monte Carlo

Hamiltonian Dynamics

Hamiltonian Dynamics

Hamiltonian Dynamics

Stationarity Theorem [5]

Stationarity Theorem [5]

Idealized HMC

Idealized HMC

Convergence Rate [6]

Some Remarks

Discretized HMC

Discretized HMC

References

References

Simulation and Sampling

Help

Convergence Rate ^[2]

Convergence of ULA ^[3]

Convergence Rate of MALA ^[4]

Stationarity Theorem ^[5]

Stationarity Theorem ^[5]

Convergence Rate ^[6]