Loading [MathJax]/jax/element/mml/optable/BasicLatin.js
+ - 0:00:00
Notes for current slide
Notes for next slide

Computational Statistics

Lecture 13

Yixuan Qiu

2023-12-13

1 / 29

Simulation and Sampling

2 / 29

Today's Topics

  • Langevin algorithm

  • Hamiltonian Monte Carlo

3 / 29

Langevin Algorithm

4 / 29

Langevin Dynamics

The Langevin dynamics, or Langevin diffusion process, is the solution to the following stochastic differential equation (SDE) on Rd:

dXt=V(Xt)dt+2dBt,X0p0(x),

where V() is the gradient of some smooth function V:RdR, and Bt is the d-dimensional Brownian motion.

5 / 29

Langevin Dynamics

The SDE can be understood as the continuous-time limit of the following iteration:

τ1(Xt+τXt)=V(Xt)+2τ1(Bt+τBt).

It is known that Bt+τBtN(0,τI), so we can rewrite the formula above as

Xt+τXt=τV(Xt)+2τξt+τ,

where ξt+τN(0,I) and is independent of Xt.

It is very similar to the gradient descent iteration, except for the additional noise term.

6 / 29

Invariant Distribution

An important property of the Langevin dynamics is that under mild conditions, it has an explicit invariant distribution. Theorem 2.1 of [1]:

Suppose that V() is continuously differentiable, and there exist some constants a,b,C< such that

V(x),xax2+b,x>C.

Let pt(x) be the density function of Xt. Then ptπTV0, where fgTV is the total variation distance between two densities f and g, and π(x)exp{V(x)}.

7 / 29

Invariant Distribution

This means that no matter what the initial distribution p0(x) is, Xt will eventually follow the distribution π(x)exp{V(x)}.

In other words, if we are interested in sampling from some target distribution π(x)exp{V(x)}, we can simulate the Langevin diffusion process of some particles, starting from an arbitrary initial distribution.

This is the foundation of the Langevin Monte Carlo algorithm.

8 / 29

Convergence Rate [2]

In addition, if V() is m-strongly-convex and L-smooth, then

ptπTV12χ2(p0π)1/2emt/2,

where χ2(f||g)=(f(x)/g(x)1)2g(x)dx is the χ2 divergence between two densities f and g.

This means that if V() is both smooth and strongly convex, then the distribution of Xt converges to π exponentially fast.

9 / 29

Unadjusted Langevin Algorithm

A natural and obvious method to simulate the SDE is using the discretized iteration:

X(0)p0(x),ξ(k)iidN(0,Id),X(k+1)=X(k)τV(X(k))+2τξ(k+1),k=0,1,

This is called the unadjusted Langevin algorithm (ULA).

10 / 29

Convergence of ULA [3]

Let pkτ denote the distribution of X(k) with step size τ. Suppose that V() is m-strongly-convex and L-smooth, and set p0=N(0,m1I). Then with

τ=mε16dL2,k=16L2m2dlog(Ldmε)ε,

we have

KL(pkτπ)ε,pkτπTVε,W2(pkτ,π)2ε/m,

where KL() is the KL divergenve, and W2(,) is the 2-Wasserstein distance.

11 / 29

Metropolis-Adjusted Langevin Algorithm

Due to the discretization, in general the distribution of X(k) generated by ULA would NOT converge to π.

However, ULA provides a good proposal distribution for the Metropolis-Hastings algorithm.

Random walk Metropolis:

  • Propose Y(k+1)N(X(k),σ2I)

  • Define α=min

  • Set X^{(k+1)}=Y^{(k+1)} with probability \alpha, otherwise X^{(k+1)}=X^{(k)}

12 / 29

Metropolis-Adjusted Langevin Algorithm

Metropolis-Adjusted Langevin Algorithm (MALA):

  • Propose Y^{(k+1)}\sim N(X^{(k)}-\tau\nabla V(X^{(k)}),2\tau I)

  • Define \small\alpha=\min\left\{ 1,\frac{\exp\left[-V(Y^{(k+1)})-\Vert X^{(k)}-Y^{(k+1)}+\tau\nabla V(Y^{(k+1)})\Vert^{2}/(4\tau)\right]}{\exp\left[-V(X^{(k)})-\Vert Y^{(k+1)}-X^{(k)}+\tau\nabla V(X^{(k)})\Vert^{2}/(4\tau)\right]}\right\}

  • Set X^{(k+1)}=Y^{(k+1)} with probability \alpha, otherwise X^{(k+1)}=X^{(k)}

13 / 29

Convergence Rate of MALA [4]

For smooth and strongly convex V(\cdot), [4] shows that MALA has advantages over ULA and random walk Metropolis.

Roughly speaking, to obtain samples with total variation error at most \varepsilon, ULA requires O(\kappa^2 d/\varepsilon^2) steps from a warm start, whereas MALA only requires O(\kappa d\log(1/\varepsilon)). Here \kappa=L/m is the condition number.

As a comparison, the random walk Metropolis requires O(\kappa^2 d\log(1/\varepsilon)) steps.

14 / 29

Underdamped Langevin Diffusion

Another variant of Langevin diffusion, called underdamped Langevin diffusion, solves the following SDE:

\begin{align*} X_{0} & =x_{0},\quad Z_{0}=z_{0},\\ \mathrm{d}X_{t} & =Z_{t}\mathrm{d}t,\\ \mathrm{d}Z_{t} & =-\gamma Z_{t}\mathrm{d}t-\nabla V(X_{t})\mathrm{d}t+\sqrt{2\gamma}\mathrm{d}B_{t},\quad t\ge0. \end{align*}

Here we introduce an auxiliary process \{Z_t\}. Interestingly, (X_t,Z_t) has a joint invariant distribution of the form

\pi(x,z)\propto \exp\{-V(x)-\Vert z\Vert^2/2\}\propto \pi(x)\cdot\exp(-\Vert z\Vert^2/2),

where \pi(x)\propto \exp\{-V(x)\} is the target distribution.

15 / 29

Underdamped Langevin Diffusion

This means that if the distribution of (X_t,Z_t) converges to the invariant distribution, then X_t eventually follows \pi(x), and Z_t converges to a normal distribution N(0,I_d) independent of X.

The discretized version is

\begin{align*} X^{(k+1)} & =X^{(k)}+\tau Z^{(k)},\\ Z^{(k+1)} & =(1-\gamma\tau)Z^{(k)}-\tau\nabla V(X^{(k)})+\sqrt{2\gamma\tau}\xi^{(k+1)},\quad k=0,1,\ldots \end{align*}

16 / 29

Underdamped Langevin Diffusion

This means that if the distribution of (X_t,Z_t) converges to the invariant distribution, then X_t eventually follows \pi(x), and Z_t converges to a normal distribution N(0,I_d) independent of X.

The discretized version is

\begin{align*} X^{(k+1)} & =X^{(k)}+\tau Z^{(k)},\\ Z^{(k+1)} & =(1-\gamma\tau)Z^{(k)}-\tau\nabla V(X^{(k)})+\sqrt{2\gamma\tau}\xi^{(k+1)},\quad k=0,1,\ldots \end{align*}

However, for the underdamped Langevin diffusion we do not have an obvious Metropolis correction method.

16 / 29

Hamiltonian Monte Carlo

17 / 29

Hamiltonian Dynamics

  • Imagine a particle with position x\in\mathbb{R}^d, velocity v\in\mathbb{R}^d, and unit mass

  • Define the total energy as H(x,v)=U(x)+\frac{1}{2}\Vert v \Vert^2

  • Also called the Hamiltonian of the particle

18 / 29

Hamiltonian Dynamics

  • The Hamiltonian dynamics are a set of differential equations: \frac{\mathrm{d}x}{\mathrm{d}t}=\frac{\partial H(x,v)}{\partial v},\quad\frac{\mathrm{d}v}{\mathrm{d}t}=-\frac{\partial H(x,v)}{\partial x}

  • We immediately find that \mathrm{d}H/\mathrm{d}t=0

  • With the given form H(x,v)=U(x)+\frac{1}{2}\Vert v \Vert^2, we have \frac{\mathrm{d}x}{\mathrm{d}t}=v,\quad\frac{\mathrm{d}v}{\mathrm{d}t}=-\nabla U(x)

19 / 29

Hamiltonian Dynamics

  • Starting from an initial value (x,v), denote the solution to the equations by \varphi_t(x,v)=\left(x_t(x,v),v_t(x,v)\right)

  • Properties

    • \varphi_t is a deterministic operator, i.e., no random variables involved so far

    • \varphi_t(x_t(x,v),-v_t(x,v))=\left(x,-v\right)

    • H(\varphi_t(x,v))=H(u,v)

20 / 29

Stationarity Theorem [5]

Suppose (X,V)\sim\pi(x,v)\propto e^{-H(x,v)}, then for any T\ge 0, \varphi_T(X,V)\sim\pi(x,v).

21 / 29

Stationarity Theorem [5]

Suppose (X,V)\sim\pi(x,v)\propto e^{-H(x,v)}, then for any T\ge 0, \varphi_T(X,V)\sim\pi(x,v).

  • (X,V)\sim\pi(x,v) implies that X and V are independent with X\sim\pi(x)\propto e^{-U(x)} and V\sim N(0,I_d)

  • The theorem indicates that the operator \varphi_T does not change the distribution of the random vectors (X,V)

  • Marginally, \varphi_T does not change the distribution of X for any V that is independent of X

21 / 29

Idealized HMC

If we define a Markov chain with the following transition algorithm from X^{(k)}:

  1. Simulate V\sim N(0,I_d) independent of X^{(k)}

  2. Let (X^{(k+1)},V')=\varphi_T(X^{(k)},V)

Then this transition kernel \mathcal{T}(x^{(k)}\rightarrow x^{(k+1)}) satisfies \pi(y)=\int\pi(x)\mathcal{T}(x\rightarrow y)\mathrm{d}x, where \pi(x)\propto e^{-U(x)}.

22 / 29

Idealized HMC

For k=1,...,K:

  1. Sample \xi\sim N(0,I_d)

  2. Set (X^{(k+1)},\xi')=\varphi_T(X^{(k)},\xi)

Then under mild conditions on U(x) we obtain the ergodicity of \{X^{(k)}\}.

Note that \varphi_T is essentially determined by \nabla U(x), and we assume the solution to the differential equations can be solved exactly.

23 / 29

Convergence Rate [6]

Assume that U:\mathbb{R}^d\rightarrow\mathbb{R} is m-strongly-convex and L-smooth. Let \nu_k be the marginal distribution of X^{(k)} from the idealized HMC algorithm. Then for any 0<\varepsilon<\sqrt{d} with X^{(0)}=\arg\min_x U(x), T=\Omega\left(\frac{1}{\sqrt{L}}\right), and k=O\left(\frac{L}{m}\log(d/\varepsilon)\right), we have W_2(\nu_k,\pi)\le\frac{\varepsilon}{\sqrt{m}}, where W_2(\cdot,\cdot) is the 2-Wasserstein distance between two distributions.

24 / 29

Some Remarks

The previous theorem means that if U(\cdot) is both smooth and strongly convex, then the distribution of X^{(k)} converges to \pi exponentially fast.

At first glance it is based on the same assumption as in the Langevin algorithm, and the result is also similar.

However, empirical results show that HMC tends to perform better for more complicated distributions, e.g. multimodal distributions.

25 / 29

Discretized HMC

  • In reality, of course, we cannot solve the differential equations exactly

  • Need to discretize the solution

  • Let q(0)=X^{(k)}, p(0)\sim N(0,I_d). For l=0,\ldots,L-1, \begin{align*} p(l+1/2) & =p(l)-\frac{\tau}{2}\nabla U(q(l))\\ q(l+1) & =q(l)+\tau p(l+1/2)\\ p(l+1) & =p(l+1/2)-\frac{\tau}{2}\nabla U(q(l+1)) \end{align*}

  • Propose Y=q(L) as the next state

  • \tau and L are tuning parameters

26 / 29

Discretized HMC

  • But now we lose the preservation of Hamiltonian

  • Need a Metropolis-Hastings correction!

  • Acceptance probability \alpha=\min\left(1,\frac{\exp\left[-H(Y,\tilde{V})\right]}{\exp\left[-H(X^{(k)},V)\right]}\right),\quad V=p(0),\quad\tilde{V}=p(L)

  • With probability \alpha set X^{(k+1)}=Y; otherwise set X^{(k+1)}=X^{(k)}

27 / 29

References

[1] Gareth O. Roberts and Richard L. Tweedie (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli.

[2] Arnak S. Dalalyan (2017). Theoretical guarantees for approximate sampling from smooth and log‐concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[3] Xiang Cheng and Peter Bartlett (2018). Convergence of Langevin MCMC in KL-divergence. Algorithmic Learning Theory.

[4] Raaz Dwivedi et al. (2019). Log-concave sampling: Metropolis-Hastings algorithms are fast. Journal of Machine Learning Research.

28 / 29

References

[5] Nisheeth K. Vishnoi (2021). An introduction to Hamiltonian Monte Carlo method for sampling. arXiv:2108.12107.

[6] Zongchen Chen & Santosh S. Vempala (2022). Optimal convergence rate of hamiltonian monte carlo for strongly logconcave distributions. Theory of Computing.

29 / 29

Simulation and Sampling

2 / 29
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow