+ - 0:00:00
Notes for current slide
Notes for next slide

Computational Statistics

Lecture 13

Yixuan Qiu

2023-12-13

1 / 29

Simulation and Sampling

2 / 29

Today's Topics

  • Langevin algorithm

  • Hamiltonian Monte Carlo

3 / 29

Langevin Algorithm

4 / 29

Langevin Dynamics

The Langevin dynamics, or Langevin diffusion process, is the solution to the following stochastic differential equation (SDE) on Rd:

dXt=V(Xt)dt+2dBt,X0p0(x),

where V() is the gradient of some smooth function V:RdR, and Bt is the d-dimensional Brownian motion.

5 / 29

Langevin Dynamics

The SDE can be understood as the continuous-time limit of the following iteration:

τ1(Xt+τXt)=V(Xt)+2τ1(Bt+τBt).

It is known that Bt+τBtN(0,τI), so we can rewrite the formula above as

Xt+τXt=τV(Xt)+2τξt+τ,

where ξt+τN(0,I) and is independent of Xt.

It is very similar to the gradient descent iteration, except for the additional noise term.

6 / 29

Invariant Distribution

An important property of the Langevin dynamics is that under mild conditions, it has an explicit invariant distribution. Theorem 2.1 of [1]:

Suppose that V() is continuously differentiable, and there exist some constants a,b,C< such that

V(x),xax2+b,x>C.

Let pt(x) be the density function of Xt. Then ptπTV0, where fgTV is the total variation distance between two densities f and g, and π(x)exp{V(x)}.

7 / 29

Invariant Distribution

This means that no matter what the initial distribution p0(x) is, Xt will eventually follow the distribution π(x)exp{V(x)}.

In other words, if we are interested in sampling from some target distribution π(x)exp{V(x)}, we can simulate the Langevin diffusion process of some particles, starting from an arbitrary initial distribution.

This is the foundation of the Langevin Monte Carlo algorithm.

8 / 29

Convergence Rate [2]

In addition, if V() is m-strongly-convex and L-smooth, then

ptπTV12χ2(p0π)1/2emt/2,

where χ2(f||g)=(f(x)/g(x)1)2g(x)dx is the χ2 divergence between two densities f and g.

This means that if V() is both smooth and strongly convex, then the distribution of Xt converges to π exponentially fast.

9 / 29

Unadjusted Langevin Algorithm

A natural and obvious method to simulate the SDE is using the discretized iteration:

X(0)p0(x),ξ(k)iidN(0,Id),X(k+1)=X(k)τV(X(k))+2τξ(k+1),k=0,1,

This is called the unadjusted Langevin algorithm (ULA).

10 / 29

Convergence of ULA [3]

Let pkτ denote the distribution of X(k) with step size τ. Suppose that V() is m-strongly-convex and L-smooth, and set p0=N(0,m1I). Then with

τ=mε16dL2,k=16L2m2dlog(Ldmε)ε,

we have

KL(pkτπ)ε,pkτπTVε,W2(pkτ,π)2ε/m,

where KL() is the KL divergenve, and W2(,) is the 2-Wasserstein distance.

11 / 29

Metropolis-Adjusted Langevin Algorithm

Due to the discretization, in general the distribution of X(k) generated by ULA would NOT converge to π.

However, ULA provides a good proposal distribution for the Metropolis-Hastings algorithm.

Random walk Metropolis:

  • Propose Y(k+1)N(X(k),σ2I)

  • Define α=min{1,π(X(k+1))/π(X(k))}

  • Set X(k+1)=Y(k+1) with probability α, otherwise X(k+1)=X(k)

12 / 29

Metropolis-Adjusted Langevin Algorithm

Metropolis-Adjusted Langevin Algorithm (MALA):

  • Propose Y(k+1)N(X(k)τV(X(k)),2τI)

  • Define α=min{1,exp[V(Y(k+1))X(k)Y(k+1)+τV(Y(k+1))2/(4τ)]exp[V(X(k))Y(k+1)X(k)+τV(X(k))2/(4τ)]}

  • Set X(k+1)=Y(k+1) with probability α, otherwise X(k+1)=X(k)

13 / 29

Convergence Rate of MALA [4]

For smooth and strongly convex V(), [4] shows that MALA has advantages over ULA and random walk Metropolis.

Roughly speaking, to obtain samples with total variation error at most ε, ULA requires O(κ2d/ε2) steps from a warm start, whereas MALA only requires O(κdlog(1/ε)). Here κ=L/m is the condition number.

As a comparison, the random walk Metropolis requires O(κ2dlog(1/ε)) steps.

14 / 29

Underdamped Langevin Diffusion

Another variant of Langevin diffusion, called underdamped Langevin diffusion, solves the following SDE:

X0=x0,Z0=z0,dXt=Ztdt,dZt=γZtdtV(Xt)dt+2γdBt,t0.

Here we introduce an auxiliary process {Zt}. Interestingly, (Xt,Zt) has a joint invariant distribution of the form

π(x,z)exp{V(x)z2/2}π(x)exp(z2/2),

where π(x)exp{V(x)} is the target distribution.

15 / 29

Underdamped Langevin Diffusion

This means that if the distribution of (Xt,Zt) converges to the invariant distribution, then Xt eventually follows π(x), and Zt converges to a normal distribution N(0,Id) independent of X.

The discretized version is

X(k+1)=X(k)+τZ(k),Z(k+1)=(1γτ)Z(k)τV(X(k))+2γτξ(k+1),k=0,1,

16 / 29

Underdamped Langevin Diffusion

This means that if the distribution of (Xt,Zt) converges to the invariant distribution, then Xt eventually follows π(x), and Zt converges to a normal distribution N(0,Id) independent of X.

The discretized version is

X(k+1)=X(k)+τZ(k),Z(k+1)=(1γτ)Z(k)τV(X(k))+2γτξ(k+1),k=0,1,

However, for the underdamped Langevin diffusion we do not have an obvious Metropolis correction method.

16 / 29

Hamiltonian Monte Carlo

17 / 29

Hamiltonian Dynamics

  • Imagine a particle with position xRd, velocity vRd, and unit mass

  • Define the total energy as H(x,v)=U(x)+12v2

  • Also called the Hamiltonian of the particle

18 / 29

Hamiltonian Dynamics

  • The Hamiltonian dynamics are a set of differential equations: dxdt=H(x,v)v,dvdt=H(x,v)x

  • We immediately find that dH/dt=0

  • With the given form H(x,v)=U(x)+12v2, we have dxdt=v,dvdt=U(x)

19 / 29

Hamiltonian Dynamics

  • Starting from an initial value (x,v), denote the solution to the equations by φt(x,v)=(xt(x,v),vt(x,v))

  • Properties

    • φt is a deterministic operator, i.e., no random variables involved so far

    • φt(xt(x,v),vt(x,v))=(x,v)

    • H(φt(x,v))=H(u,v)

20 / 29

Stationarity Theorem [5]

Suppose (X,V)π(x,v)eH(x,v), then for any T0, φT(X,V)π(x,v).

21 / 29

Stationarity Theorem [5]

Suppose (X,V)π(x,v)eH(x,v), then for any T0, φT(X,V)π(x,v).

  • (X,V)π(x,v) implies that X and V are independent with Xπ(x)eU(x) and VN(0,Id)

  • The theorem indicates that the operator φT does not change the distribution of the random vectors (X,V)

  • Marginally, φT does not change the distribution of X for any V that is independent of X

21 / 29

Idealized HMC

If we define a Markov chain with the following transition algorithm from X(k):

  1. Simulate VN(0,Id) independent of X(k)

  2. Let (X(k+1),V)=φT(X(k),V)

Then this transition kernel T(x(k)x(k+1)) satisfies π(y)=π(x)T(xy)dx, where π(x)eU(x).

22 / 29

Idealized HMC

For k=1,...,K:

  1. Sample ξN(0,Id)

  2. Set (X(k+1),ξ)=φT(X(k),ξ)

Then under mild conditions on U(x) we obtain the ergodicity of {X(k)}.

Note that φT is essentially determined by U(x), and we assume the solution to the differential equations can be solved exactly.

23 / 29

Convergence Rate [6]

Assume that U:RdR is m-strongly-convex and L-smooth. Let νk be the marginal distribution of X(k) from the idealized HMC algorithm. Then for any 0<ε<d with X(0)=argminxU(x), T=Ω(1L), and k=O(Lmlog(d/ε)), we have W2(νk,π)εm, where W2(,) is the 2-Wasserstein distance between two distributions.

24 / 29

Some Remarks

The previous theorem means that if U() is both smooth and strongly convex, then the distribution of X(k) converges to π exponentially fast.

At first glance it is based on the same assumption as in the Langevin algorithm, and the result is also similar.

However, empirical results show that HMC tends to perform better for more complicated distributions, e.g. multimodal distributions.

25 / 29

Discretized HMC

  • In reality, of course, we cannot solve the differential equations exactly

  • Need to discretize the solution

  • Let q(0)=X(k), p(0)N(0,Id). For l=0,,L1, p(l+1/2)=p(l)τ2U(q(l))q(l+1)=q(l)+τp(l+1/2)p(l+1)=p(l+1/2)τ2U(q(l+1))

  • Propose Y=q(L) as the next state

  • τ and L are tuning parameters

26 / 29

Discretized HMC

  • But now we lose the preservation of Hamiltonian

  • Need a Metropolis-Hastings correction!

  • Acceptance probability α=min(1,exp[H(Y,V~)]exp[H(X(k),V)]),V=p(0),V~=p(L)

  • With probability α set X(k+1)=Y; otherwise set X(k+1)=X(k)

27 / 29

References

[1] Gareth O. Roberts and Richard L. Tweedie (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli.

[2] Arnak S. Dalalyan (2017). Theoretical guarantees for approximate sampling from smooth and log‐concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[3] Xiang Cheng and Peter Bartlett (2018). Convergence of Langevin MCMC in KL-divergence. Algorithmic Learning Theory.

[4] Raaz Dwivedi et al. (2019). Log-concave sampling: Metropolis-Hastings algorithms are fast. Journal of Machine Learning Research.

28 / 29

References

[5] Nisheeth K. Vishnoi (2021). An introduction to Hamiltonian Monte Carlo method for sampling. arXiv:2108.12107.

[6] Zongchen Chen & Santosh S. Vempala (2022). Optimal convergence rate of hamiltonian monte carlo for strongly logconcave distributions. Theory of Computing.

29 / 29

Simulation and Sampling

2 / 29
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow