Langevin algorithm
Hamiltonian Monte Carlo
The Langevin dynamics, or Langevin diffusion process, is the solution to the following stochastic differential equation (SDE) on Rd:
dXt=−∇V(Xt)dt+√2dBt,X0∼p0(x),
where ∇V(⋅) is the gradient of some smooth function V:Rd→R, and Bt is the d-dimensional Brownian motion.
The SDE can be understood as the continuous-time limit of the following iteration:
τ−1(Xt+τ−Xt)=−∇V(Xt)+√2τ−1(Bt+τ−Bt).
It is known that Bt+τ−Bt∼N(0,τI), so we can rewrite the formula above as
Xt+τ−Xt=−τ∇V(Xt)+√2τξt+τ,
where ξt+τ∼N(0,I) and is independent of Xt.
It is very similar to the gradient descent iteration, except for the additional noise term.
An important property of the Langevin dynamics is that under mild conditions, it has an explicit invariant distribution. Theorem 2.1 of [1]:
Suppose that ∇V(⋅) is continuously differentiable, and there exist some constants a,b,C<∞ such that
⟨−∇V(x),x⟩≤a∥x∥2+b,∀∥x∥>C.
Let pt(x) be the density function of Xt. Then ∥pt−π∥TV→0, where ∥f−g∥TV is the total variation distance between two densities f and g, and π(x)∝exp{−V(x)}.
This means that no matter what the initial distribution p0(x) is, Xt will eventually follow the distribution π(x)∝exp{−V(x)}.
In other words, if we are interested in sampling from some target distribution π(x)∝exp{−V(x)}, we can simulate the Langevin diffusion process of some particles, starting from an arbitrary initial distribution.
This is the foundation of the Langevin Monte Carlo algorithm.
In addition, if V(⋅) is m-strongly-convex and L-smooth, then
∥pt−π∥TV≤12χ2(p0∥π)1/2e−mt/2,
where χ2(f||g)=∫(f(x)/g(x)−1)2g(x)dx is the χ2 divergence between two densities f and g.
This means that if V(⋅) is both smooth and strongly convex, then the distribution of Xt converges to π exponentially fast.
A natural and obvious method to simulate the SDE is using the discretized iteration:
X(0)∼p0(x),ξ(k)iid∼N(0,Id),X(k+1)=X(k)−τ∇V(X(k))+√2τξ(k+1),k=0,1,…
This is called the unadjusted Langevin algorithm (ULA).
Let pkτ denote the distribution of X(k) with step size τ. Suppose that V(⋅) is m-strongly-convex and L-smooth, and set p0=N(0,m−1I). Then with
τ=mε16dL2,k=16L2m2⋅dlog(Ldmε)ε,
we have
KL(pkτ∥π)≤ε,∥pkτ−π∥TV≤√ε,W2(pkτ,π)≤√2ε/m,
where KL(⋅∥⋅) is the KL divergenve, and W2(⋅,⋅) is the 2-Wasserstein distance.
Due to the discretization, in general the distribution of X(k) generated by ULA would NOT converge to π.
However, ULA provides a good proposal distribution for the Metropolis-Hastings algorithm.
Random walk Metropolis:
Propose Y(k+1)∼N(X(k),σ2I)
Define α=min{1,π(X(k+1))/π(X(k))}
Set X(k+1)=Y(k+1) with probability α, otherwise X(k+1)=X(k)
Metropolis-Adjusted Langevin Algorithm (MALA):
Propose Y(k+1)∼N(X(k)−τ∇V(X(k)),2τI)
Define α=min{1,exp[−V(Y(k+1))−∥X(k)−Y(k+1)+τ∇V(Y(k+1))∥2/(4τ)]exp[−V(X(k))−∥Y(k+1)−X(k)+τ∇V(X(k))∥2/(4τ)]}
Set X(k+1)=Y(k+1) with probability α, otherwise X(k+1)=X(k)
For smooth and strongly convex V(⋅), [4] shows that MALA has advantages over ULA and random walk Metropolis.
Roughly speaking, to obtain samples with total variation error at most ε, ULA requires O(κ2d/ε2) steps from a warm start, whereas MALA only requires O(κdlog(1/ε)). Here κ=L/m is the condition number.
As a comparison, the random walk Metropolis requires O(κ2dlog(1/ε)) steps.
Another variant of Langevin diffusion, called underdamped Langevin diffusion, solves the following SDE:
X0=x0,Z0=z0,dXt=Ztdt,dZt=−γZtdt−∇V(Xt)dt+√2γdBt,t≥0.
Here we introduce an auxiliary process {Zt}. Interestingly, (Xt,Zt) has a joint invariant distribution of the form
π(x,z)∝exp{−V(x)−∥z∥2/2}∝π(x)⋅exp(−∥z∥2/2),
where π(x)∝exp{−V(x)} is the target distribution.
This means that if the distribution of (Xt,Zt) converges to the invariant distribution, then Xt eventually follows π(x), and Zt converges to a normal distribution N(0,Id) independent of X.
The discretized version is
X(k+1)=X(k)+τZ(k),Z(k+1)=(1−γτ)Z(k)−τ∇V(X(k))+√2γτξ(k+1),k=0,1,…
This means that if the distribution of (Xt,Zt) converges to the invariant distribution, then Xt eventually follows π(x), and Zt converges to a normal distribution N(0,Id) independent of X.
The discretized version is
X(k+1)=X(k)+τZ(k),Z(k+1)=(1−γτ)Z(k)−τ∇V(X(k))+√2γτξ(k+1),k=0,1,…
However, for the underdamped Langevin diffusion we do not have an obvious Metropolis correction method.
Imagine a particle with position x∈Rd, velocity v∈Rd, and unit mass
Define the total energy as H(x,v)=U(x)+12∥v∥2
Also called the Hamiltonian of the particle
The Hamiltonian dynamics are a set of differential equations: dxdt=∂H(x,v)∂v,dvdt=−∂H(x,v)∂x
We immediately find that dH/dt=0
With the given form H(x,v)=U(x)+12∥v∥2, we have dxdt=v,dvdt=−∇U(x)
Starting from an initial value (x,v), denote the solution to the equations by φt(x,v)=(xt(x,v),vt(x,v))
Properties
φt is a deterministic operator, i.e., no random variables involved so far
φt(xt(x,v),−vt(x,v))=(x,−v)
H(φt(x,v))=H(u,v)
Suppose (X,V)∼π(x,v)∝e−H(x,v), then for any T≥0, φT(X,V)∼π(x,v).
Suppose (X,V)∼π(x,v)∝e−H(x,v), then for any T≥0, φT(X,V)∼π(x,v).
(X,V)∼π(x,v) implies that X and V are independent with X∼π(x)∝e−U(x) and V∼N(0,Id)
The theorem indicates that the operator φT does not change the distribution of the random vectors (X,V)
Marginally, φT does not change the distribution of X for any V that is independent of X
If we define a Markov chain with the following transition algorithm from X(k):
Simulate V∼N(0,Id) independent of X(k)
Let (X(k+1),V′)=φT(X(k),V)
Then this transition kernel T(x(k)→x(k+1)) satisfies π(y)=∫π(x)T(x→y)dx, where π(x)∝e−U(x).
For k=1,...,K:
Sample ξ∼N(0,Id)
Set (X(k+1),ξ′)=φT(X(k),ξ)
Then under mild conditions on U(x) we obtain the ergodicity of {X(k)}.
Note that φT is essentially determined by ∇U(x), and we assume the solution to the differential equations can be solved exactly.
Assume that U:Rd→R is m-strongly-convex and L-smooth. Let νk be the marginal distribution of X(k) from the idealized HMC algorithm. Then for any 0<ε<√d with X(0)=argminxU(x), T=Ω(1√L), and k=O(Lmlog(d/ε)), we have W2(νk,π)≤ε√m, where W2(⋅,⋅) is the 2-Wasserstein distance between two distributions.
The previous theorem means that if U(⋅) is both smooth and strongly convex, then the distribution of X(k) converges to π exponentially fast.
At first glance it is based on the same assumption as in the Langevin algorithm, and the result is also similar.
However, empirical results show that HMC tends to perform better for more complicated distributions, e.g. multimodal distributions.
In reality, of course, we cannot solve the differential equations exactly
Need to discretize the solution
Let q(0)=X(k), p(0)∼N(0,Id). For l=0,…,L−1, p(l+1/2)=p(l)−τ2∇U(q(l))q(l+1)=q(l)+τp(l+1/2)p(l+1)=p(l+1/2)−τ2∇U(q(l+1))
Propose Y=q(L) as the next state
τ and L are tuning parameters
But now we lose the preservation of Hamiltonian
Need a Metropolis-Hastings correction!
Acceptance probability α=min(1,exp[−H(Y,~V)]exp[−H(X(k),V)]),V=p(0),~V=p(L)
With probability α set X(k+1)=Y; otherwise set X(k+1)=X(k)
[1] Gareth O. Roberts and Richard L. Tweedie (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli.
[2] Arnak S. Dalalyan (2017). Theoretical guarantees for approximate sampling from smooth and log‐concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology).
[3] Xiang Cheng and Peter Bartlett (2018). Convergence of Langevin MCMC in KL-divergence. Algorithmic Learning Theory.
[4] Raaz Dwivedi et al. (2019). Log-concave sampling: Metropolis-Hastings algorithms are fast. Journal of Machine Learning Research.
[5] Nisheeth K. Vishnoi (2021). An introduction to Hamiltonian Monte Carlo method for sampling. arXiv:2108.12107.
[6] Zongchen Chen & Santosh S. Vempala (2022). Optimal convergence rate of hamiltonian monte carlo for strongly logconcave distributions. Theory of Computing.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |