Importance sampling
Measure transport sampler
Strictly speaking, importance sampling (IS) is not a method to obtain sample points X1,…,Xn that follow the target distribution p(x).
Instead, it is a technique to estimate expectations related to p(x):
μ=EX[f(X)]=∫f(x)p(x)dx,X∼p(x).
Of course, one direct method to approximate μ is to generate X1,…,XM∼p(x), and then an unbiased estimator for μ is given by
^μ=1MM∑i=1f(Xi),Xi∼p(x), i=1,…,M.
Suppose that we use rejection sampling to get Xi based on a proposal distribution q(x). There are two issues here:
We want to estimate p=P(X>π) and μ=E(X|X>π), where X∼N(0,1).
Naive solution: sample X1,…,XMiid∼N(0,1), and get ^p=1MM∑i=1I{Xi>π},^μ=∑Mi=1Xi⋅I{Xi>π}∑Mi=1I{Xi>π}.
Problem: p is very small (true value ~0.00084), so maybe all Xi's are smaller than π.
est_naive = function(n){ x = rnorm(n) p_hat = mean(x > pi) mu_hat = sum(x * (x > pi)) / sum(x > pi) c(p_hat, mu_hat)}set.seed(123)est_naive(n = 100)
## [1] 0 NaN
IS attempts to resolve the previous issues:
It does not discard or waste any sample points
Instead, it assigns different weights to each point
By properly choosing the proposal distribution, it is able to more effectively generate sample points around the "important region" A
The idea of IS is in fact quite simple. It is based on a straightforward identity:
μ=∫f(x)p(x)dx=∫f(x)p(x)q(x)q(x)dx=Eq(f(X)p(X)q(X)),
where q(x) is another density function that is positive on Rp, and Eq(⋅) denotes the expectation for X∼q(x).
Accordingly, the IS estimate for μ is
^μq=1MM∑i=1f(Xi)p(Xi)q(Xi),Xi∼q(x), i=1,…,M.
Suppose that q(x)>0 whenever f(x)p(x)≠0. Then Eq(^μq)=μ, and Varq(^μq)=σ2q/n, where
σ2q=∫Q[f(x)p(x)]2q(x)dx−μ2=∫Q[f(x)p(x)−μq(x)]2q(x)dx,
and Q={x:q(x)>0}.
Back to the previous example, note that
p=∫+∞−∞I(x>π)ϕ(x)dx=∫+∞πI(x>π)ϕ(x)q(x)q(x)dx=∫+∞πϕ(x)q(x)q(x)dx,q(x)={0,x≤πe−(x−π),x>π,
where q(x) is a shifted exponential distribution. Similarly,
μ=p−1∫+∞πx⋅ϕ(x)q(x)q(x)dx.
est_is = function(n){ x = rexp(n) + pi ratio = exp(dnorm(x, log = TRUE) + x - pi) p_hat = mean(ratio) mu_hat = mean(x * ratio) / p_hat c(p_hat, mu_hat)}set.seed(123)est_is(n = 100)
## [1] 0.0007478989 3.4296517837
It can be proved that the optimal proposal distribution q∗(x) is given by q∗(x)=|f(x)|p(x)/Ep(|f(X)|).
Proof: for any density function q(x) such that q(x)>0 when f(x)p(x)≠0,
μ2+σ2q∗=∫[f(x)p(x)]2q∗(x)dx=[Ep(|f(X)|)]2=[Eq(|f(X)|p(X)q(X))]2≤Eq([f(X)p(X)]2[q(X)]2)=∫[f(x)p(x)]2q(x)dx=μ2+σ2q.
This means that to approximate ∫f(x)p(x)dx, IS can be better than the simple Monte Carlo estimator!
Suppose that we can only compute pu(x)∝p(x) and qu(x)∝q(x), the self-normalized IS estimate is given by
~μ=∑Mi=1f(Xi)w(Xi)∑Mi=1w(Xi),
where w(x)=pu(x)/qu(x) and Xi∼q(x).
Under mild conditions, ~μ is a consistent estimator of μ, but in general ~μ is no longer unbiased.
U∼Unif(0,1)
g=F−1
X=g(U)⇒X∼f(x)
f(x)=F′(x) is the density function
Continuous random variable X∼pX(x)
pX(x) density function
g:R→R a monotone function
Define Y=g(X), then its density function is given by pY(y)=pX(g−1(y))∣∣∣ddyg−1(y)∣∣∣
Can extend to multivariate case
Continuous random vector X∈Rd, X∼pX(x)
pX(x) density function
T:Rd→Rd a diffeomorphism (a smooth mapping with smooth inverse)
Define Y=T(X), then its density function is given by pY(y)=pX(T−1(y))∣∣det(∇(T−1)(y))∣∣=pX(x)|det(∇T(x))|−1,x=T−1(y)
∇T (cf. ∇T−1) is the Jacobian matrix of T (cf. T−1)
(U1,U2)∼Unif([0,1]2)
Let Z1=√−2log(U1)cos(2πU2) and Z2=√−2log(U1)sin(2πU2)
Then Z1 and Z2 are two independent N(0,1) random variables
Transformation mapping T(U1U2)=(√−2log(U1)cos(2πU2)√−2log(U1)sin(2πU2))
Jacobian matrix ∇T(U1U2)=⎛⎜⎝−cos(θ)U1L−2πLsin(θ)−sin(θ)U1L2πLcos(θ)⎞⎟⎠L=√−2log(U1),θ=2πU2
Determinant det∇T=−2πU−11
Clearly, pU(u1,u2)=1, so pZ(z1,z2)=pU(u1,u2)|det∇T|−1=(2π)−1u1
To express u1 using z1 and z2, note that Z21+Z22=−2log(U1)
Therefore, pZ(z1,z2)=12πe−12(z21+z22), which implies that Z1 and Z2 follow independent N(0,1)
Fix a source distribution pZ(z), e.g. N(0,I)
Given a target distribution p(x)
Can we learn a mapping T such that X=T(Z)∼p(x) if Z∼pZ(z)?
If we can obtain such a mapping, sampling would be easy:
Simulate Z1,…,Zn∼N(0,I)
Set X1=T(Z1),…,Xn=T(Zn)
Then Xiiid∼p(x)
The key is to find such a mapping T, also called a transport map
Fix a source distribution Z∼p0(z), e.g. N(0,I)
Assume we have a model Tθ∈T to parameterize T
T is chosen such that Tθ is a diffeomorphism (we will cover this later)
Then X=Tθ(Z)∼pθ(x) has a known density function
The problem becomes how to estimate θ
Given target distribution p(x)
We want to make pθ(x) close to p(x) as much as possible
A natural criterion is the Kullback-Leibler (KL) divergence DKL(pθ∥p)=∫pθ(x)logpθ(x)p(x)dx=Epθ(logpθp)
The best θ is given by θ∗=argminθ DKL(pθ∥p)
An unbiased estimator for DKL(pθ∥p) is given by ^L(θ)=1MM∑i=1[logpθ(Xi)−logp(Xi)], where Xi=Tθ(Zi), Zi∼N(0,I)
We further have pθ(x)=p0(z)|det(∇Tθ(z))|−1,z=T−1θ(x)
As a result, ^L(θ)=1MM∑i=1[logp0(Zi)−log|det(∇Tθ(Zi))|−logp(Xi)]=−1MM∑i=1[log|det(∇Tθ(Zi))|+logp(Tθ(Zi))]+C
Therefore, EZ^L(θ)=DKL(pθ∥p)EZ(∇θ^L(θ))=∇θDKL(pθ∥p)
We can then use stochastic gradient descent to optimize DKL(pθ∥p).
In each iteration:
Simulate Ziiid∼N(0,I)
Define ^ℓ(θ)=−1MM∑i=1[log|det(∇Tθ(Zi))|+logp(Tθ(Zi))]
Compute g(θ)=∇θ^ℓ(θ)
Update θ←θ−α⋅g(θ)
In fact, the training algorithm is also valid for unnormalized target densities.
Assume p(x)∝e−E(x)
Then just change ^ℓ(θ) to ^ℓ(θ)=−1MM∑i=1[log|det(∇Tθ(Zi))|−E(Tθ(Zi))]
This is because the unknown constant does not affect the gradient
For practical use, the transport map Tθ needs to have some "nice" properties:
Tθ should be invertible and differentiable
T−1θ and det(∇Tθ) should be easy to compute
Tθ should be flexible enough to characterize sophisticated nonlinear mappings
At first glance, these are quite strong conditions.
Polynomials [2] (not good enough)
Normalizing flows [3] (tools from the deep learning community)
Normalizing flows (NFs) are a class of diffeomorphisms constructed by neural networks
NFs define invertible mappings through composition: T=TK∘⋯∘T1
Each Ti is a simpler diffeomorphism
Many different implementations of NFs [4,5]:
Permutation and orthogonal
Decomposition-based
Planar and radial
Coupling
Autoregressive
...
One popular implementation of NF is the affine coupling flow, also called Real NVP [6]
Given X∈Rd, define Y=T(X)∈Rd as below: Y1:r=X1:rY(r+1):d=μ(X1:r)+σ(X1:r)⊙X(r+1):d
X1:r is the first r elements of X, r<d
⊙ is elementwise multiplication
μ,σ:Rr→Rd−r are two neural networks, σ(⋅)>0
Consider the composition of two transformations: ⎛⎜ ⎜ ⎜⎝X1X2X3X4⎞⎟ ⎟ ⎟⎠T1⇒⎛⎜ ⎜ ⎜ ⎜⎝Y1=X1Y2=X2Y3=μY3(X1,X2)+σY3(X1,X2)⋅X3Y4=μY4(X1,X2)+σY4(X1,X2)⋅X4⎞⎟ ⎟ ⎟ ⎟⎠⎛⎜ ⎜ ⎜⎝Y1Y2Y3Y4⎞⎟ ⎟ ⎟⎠T2⇒⎛⎜ ⎜ ⎜ ⎜⎝Z1=μZ1(Y3,Y4)+σZ1(Y3,Y4)⋅Y1Z2=μZ2(Y3,Y4)+σZ2(Y3,Y4)⋅Y2Z3=Y3Z4=Y4⎞⎟ ⎟ ⎟ ⎟⎠
T1 can be easily inverted (similar for T2): ⎛⎜ ⎜ ⎜⎝X1X2X3X4⎞⎟ ⎟ ⎟⎠T1⇒⎛⎜ ⎜ ⎜ ⎜⎝Y1=X1Y2=X2Y3=μY3(X1,X2)+σY3(X1,X2)⋅X3Y4=μY4(X1,X2)+σY4(X1,X2)⋅X4⎞⎟ ⎟ ⎟ ⎟⎠⎛⎜ ⎜ ⎜ ⎜⎝X1=Y1X2=Y2X3=[Y3−μY3(Y1,Y2)]/σY3(Y1,Y2)X4=[Y4−μY4(Y1,Y2)]/σY4(Y1,Y2)⎞⎟ ⎟ ⎟ ⎟⎠T−11⇐⎛⎜ ⎜ ⎜⎝Y1Y2Y3Y4⎞⎟ ⎟ ⎟⎠
∇T1 is lower triangular: ⎛⎜ ⎜ ⎜⎝X1X2X3X4⎞⎟ ⎟ ⎟⎠T1⇒⎛⎜ ⎜ ⎜ ⎜⎝Y1=X1Y2=X2Y3=μY3(X1,X2)+σY3(X1,X2)⋅X3Y4=μY4(X1,X2)+σY4(X1,X2)⋅X4⎞⎟ ⎟ ⎟ ⎟⎠∇T1(x)=⎛⎜ ⎜ ⎜ ⎜⎝10000100∗∗σY3(x1,x2)0∗∗0σY4(x1,x2)⎞⎟ ⎟ ⎟ ⎟⎠
det(∇T1) can be computed in linear time (the product of diagonal elements).
Some theoretical works such as [7] show that Real NVP flows are universal approximators.
As a result, all three conditions are satisfied.
Construct the transport map Tθ using normalizing flows
The vector θ contains all the neural network parameters
∇θ^ℓ(θ) is typically computed using automatic differentiation
Recent literature such as [8] shows that the measure transport sampler based on the KL loss function does not learn multimodal distributions well.
Some improved samplers exist, and more is yet to be explored.
[1] https://artowen.su.domains/mc/Ch-var-is.pdf
[2] Youssef Marzouk, Tarek Moselhy, Matthew Parno, and Alessio Spantini (2016). An introduction to sampling via measure transport. arXiv:1602.05023.
[3] Matthew Hoffman et al. (2019). Neutra-lizing bad geometry in Hamiltonian Monte Carlo using neural transport. arXiv preprint arXiv:1903.03704.
[4] George Papamakarios et al. (2021). Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research.
[5] Ivan Kobyzev, Simon J.D. Prince, and Marcus A. Brubaker (2020). Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[6] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio (2016). Density estimation using Real NVP. arXiv preprint arXiv:1605.08803.
[7] Takeshi Teshima et al. (2020). Coupling-based invertible neural networks are universal diffeomorphism approximators. Advances in Neural Information Processing Systems 33.
[8] Yixuan Qiu and Xiao Wang (2023). Efficient multimodal sampling via tempered distribution flow. Journal of the American Statistical Association.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |