+ - 0:00:00
Notes for current slide
Notes for next slide

Computational Statistics

Lecture 11

Yixuan Qiu

2023-11-29

1 / 47

Simulation and Sampling

2 / 47

Today's Topics

  • Importance sampling

  • Measure transport sampler

3 / 47

Importance Sampling

4 / 47

Importance sampling

Strictly speaking, importance sampling (IS) is not a method to obtain sample points X1,,Xn that follow the target distribution p(x).

Instead, it is a technique to estimate expectations related to p(x):

μ=EX[f(X)]=f(x)p(x)dx,Xp(x).

5 / 47

A Direct Solution

Of course, one direct method to approximate μ is to generate X1,,XMp(x), and then an unbiased estimator for μ is given by

μ^=1Mi=1Mf(Xi),Xip(x), i=1,,M.

Suppose that we use rejection sampling to get Xi based on a proposal distribution q(x). There are two issues here:

  • Rejection sampling discards sample points
  • f(x) may be close to zero outside a region A for which P(XA) is small
6 / 47

Example

We want to estimate p=P(X>π) and μ=E(X|X>π), where XN(0,1).

Naive solution: sample X1,,XMiidN(0,1), and get p^=1Mi=1MI{Xi>π},μ^=i=1MXiI{Xi>π}i=1MI{Xi>π}.

Problem: p is very small (true value ~0.00084), so maybe all Xi's are smaller than π.

7 / 47

Example

est_naive = function(n)
{
x = rnorm(n)
p_hat = mean(x > pi)
mu_hat = sum(x * (x > pi)) / sum(x > pi)
c(p_hat, mu_hat)
}
set.seed(123)
est_naive(n = 100)
## [1] 0 NaN
8 / 47

Motivation

IS attempts to resolve the previous issues:

  • It does not discard or waste any sample points

  • Instead, it assigns different weights to each point

  • By properly choosing the proposal distribution, it is able to more effectively generate sample points around the "important region" A

9 / 47

Basic Idea

The idea of IS is in fact quite simple. It is based on a straightforward identity:

μ=f(x)p(x)dx=f(x)p(x)q(x)q(x)dx=Eq(f(X)p(X)q(X)),

where q(x) is another density function that is positive on Rp, and Eq() denotes the expectation for Xq(x).

Accordingly, the IS estimate for μ is

μ^q=1Mi=1Mf(Xi)p(Xi)q(Xi),Xiq(x), i=1,,M.

10 / 47

Theorem [1]

Suppose that q(x)>0 whenever f(x)p(x)0. Then Eq(μ^q)=μ, and Varq(μ^q)=σq2/n, where

σq2=Q[f(x)p(x)]2q(x)dxμ2=Q[f(x)p(x)μq(x)]2q(x)dx,

and Q={x:q(x)>0}.

11 / 47

Example

Back to the previous example, note that

p=+I(x>π)ϕ(x)dx=π+I(x>π)ϕ(x)q(x)q(x)dx=π+ϕ(x)q(x)q(x)dx,q(x)={0,xπe(xπ),x>π,

where q(x) is a shifted exponential distribution. Similarly,

μ=p1π+xϕ(x)q(x)q(x)dx.

12 / 47

Example

est_is = function(n)
{
x = rexp(n) + pi
ratio = exp(dnorm(x, log = TRUE) + x - pi)
p_hat = mean(ratio)
mu_hat = mean(x * ratio) / p_hat
c(p_hat, mu_hat)
}
set.seed(123)
est_is(n = 100)
## [1] 0.0007478989 3.4296517837
13 / 47

Optimal q(x) [1]

It can be proved that the optimal proposal distribution q(x) is given by q(x)=|f(x)|p(x)/Ep(|f(X)|).

Proof: for any density function q(x) such that q(x)>0 when f(x)p(x)0,

μ2+σq2=[f(x)p(x)]2q(x)dx=[Ep(|f(X)|)]2=[Eq(|f(X)|p(X)q(X))]2Eq([f(X)p(X)]2[q(X)]2)=[f(x)p(x)]2q(x)dx=μ2+σq2.

14 / 47

Optimal q(x)

This means that to approximate f(x)p(x)dx, IS can be better than the simple Monte Carlo estimator!

15 / 47

Self-Normalized IS

Suppose that we can only compute pu(x)p(x) and qu(x)q(x), the self-normalized IS estimate is given by

μ~=i=1Mf(Xi)w(Xi)i=1Mw(Xi),

where w(x)=pu(x)/qu(x) and Xiq(x).

Under mild conditions, μ~ is a consistent estimator of μ, but in general μ~ is no longer unbiased.

16 / 47

Measure Transport Sampler

17 / 47

Recap: Inverse Transform Algorithm

  • UUnif(0,1)

  • g=F1

  • X=g(U)Xf(x)

  • f(x)=F(x) is the density function

18 / 47

Random Variable Transformation

  • Continuous random variable XpX(x)

  • pX(x) density function

  • g:RR a monotone function

  • Define Y=g(X), then its density function is given by pY(y)=pX(g1(y))|ddyg1(y)|

  • Can extend to multivariate case

19 / 47

Random Vector Transformation

  • Continuous random vector XRd, XpX(x)

  • pX(x) density function

  • T:RdRd a diffeomorphism (a smooth mapping with smooth inverse)

  • Define Y=T(X), then its density function is given by pY(y)=pX(T1(y))|det((T1)(y))|=pX(x)|det(T(x))|1,x=T1(y)

  • T (cf. T1) is the Jacobian matrix of T (cf. T1)

20 / 47

Recall: Box-Muller Transform

  1. (U1,U2)Unif([0,1]2)

  2. Let Z1=2log(U1)cos(2πU2) and Z2=2log(U1)sin(2πU2)

  3. Then Z1 and Z2 are two independent N(0,1) random variables

21 / 47

Recall: Box-Muller Transform

  • Transformation mapping T(U1U2)=(2log(U1)cos(2πU2)2log(U1)sin(2πU2))

  • Jacobian matrix T(U1U2)=(cos(θ)U1L2πLsin(θ)sin(θ)U1L2πLcos(θ))L=2log(U1),θ=2πU2

  • Determinant detT=2πU11

22 / 47

Recall: Box-Muller Transform

  • Clearly, pU(u1,u2)=1, so pZ(z1,z2)=pU(u1,u2)|detT|1=(2π)1u1

  • To express u1 using z1 and z2, note that Z12+Z22=2log(U1)

  • Therefore, pZ(z1,z2)=12πe12(z12+z22), which implies that Z1 and Z2 follow independent N(0,1)

23 / 47

Ideas from Inverse Transform

  • Fix a source distribution pZ(z), e.g. N(0,I)

  • Given a target distribution p(x)

  • Can we learn a mapping T such that X=T(Z)p(x) if ZpZ(z)?

24 / 47

Measure Transport Sampler

  • If we can obtain such a mapping, sampling would be easy:

    • Simulate Z1,,ZnN(0,I)

    • Set X1=T(Z1),,Xn=T(Zn)

    • Then Xiiidp(x)

  • The key is to find such a mapping T, also called a transport map

25 / 47

Estimating Transport Map

  • Fix a source distribution Zp0(z), e.g. N(0,I)

  • Assume we have a model TθT to parameterize T

  • T is chosen such that Tθ is a diffeomorphism (we will cover this later)

  • Then X=Tθ(Z)pθ(x) has a known density function

  • The problem becomes how to estimate θ

26 / 47

Criterion: KL Divergence

  • Given target distribution p(x)

  • We want to make pθ(x) close to p(x) as much as possible

  • A natural criterion is the Kullback-Leibler (KL) divergence DKL(pθp)=pθ(x)logpθ(x)p(x)dx=Epθ(logpθp)

  • The best θ is given by θ=argminθ DKL(pθp)

27 / 47

Training Algorithm

  • An unbiased estimator for DKL(pθp) is given by L^(θ)=1Mi=1M[logpθ(Xi)logp(Xi)], where Xi=Tθ(Zi), ZiN(0,I)

  • We further have pθ(x)=p0(z)|det(Tθ(z))|1,z=Tθ1(x)

28 / 47

Training Algorithm

  • As a result, L^(θ)=1Mi=1M[logp0(Zi)log|det(Tθ(Zi))|logp(Xi)]=1Mi=1M[log|det(Tθ(Zi))|+logp(Tθ(Zi))]+C

  • Therefore, EZL^(θ)=DKL(pθp)EZ(θL^(θ))=θDKL(pθp)

29 / 47

Stochastic Gradient Descent

We can then use stochastic gradient descent to optimize DKL(pθp).

In each iteration:

  • Simulate ZiiidN(0,I)

  • Define ^(θ)=1Mi=1M[log|det(Tθ(Zi))|+logp(Tθ(Zi))]

  • Compute g(θ)=θ^(θ)

  • Update θθαg(θ)

30 / 47

Unnormalized Target Density

  • In fact, the training algorithm is also valid for unnormalized target densities.

  • Assume p(x)eE(x)

  • Then just change ^(θ) to ^(θ)=1Mi=1M[log|det(Tθ(Zi))|E(Tθ(Zi))]

  • This is because the unknown constant does not affect the gradient

31 / 47

Demonstration

32 / 47

Modeling Transport Maps

33 / 47

Modeling Transport Maps

For practical use, the transport map Tθ needs to have some "nice" properties:

  • Tθ should be invertible and differentiable

  • Tθ1 and det(Tθ) should be easy to compute

  • Tθ should be flexible enough to characterize sophisticated nonlinear mappings

At first glance, these are quite strong conditions.

34 / 47

Modeling Transport Maps

  • Polynomials [2] (not good enough)

  • Normalizing flows [3] (tools from the deep learning community)

35 / 47

Normalizing Flow

  • Normalizing flows (NFs) are a class of diffeomorphisms constructed by neural networks

  • NFs define invertible mappings through composition: T=TKT1

  • Each Ti is a simpler diffeomorphism

36 / 47

Change-of-Variable Revisited

  • Recall the density formula for X=T(Z): pX(x)=pZ(z)|det(T(z))|1,z=T1(x)
  • For T=TKT1, let z0=z, zK=x, and zi=Ti(zi1), then log|det(T(z))|=i=1Klog|det(Ti(zi1))|
  • So we still obtain an explicit form for pX(x)
37 / 47

Implementations

Many different implementations of NFs [4,5]:

  • Permutation and orthogonal

  • Decomposition-based

  • Planar and radial

  • Coupling

  • Autoregressive

  • ...

38 / 47

Example: Real NVP

  • One popular implementation of NF is the affine coupling flow, also called Real NVP [6]

  • Given XRd, define Y=T(X)Rd as below: Y1:r=X1:rY(r+1):d=μ(X1:r)+σ(X1:r)X(r+1):d

  • X1:r is the first r elements of X, r<d

  • is elementwise multiplication

  • μ,σ:RrRdr are two neural networks, σ()>0

39 / 47

Why Real NVP Works

Consider the composition of two transformations: (X1X2X3X4)T1(Y1=X1Y2=X2Y3=μY3(X1,X2)+σY3(X1,X2)X3Y4=μY4(X1,X2)+σY4(X1,X2)X4)(Y1Y2Y3Y4)T2(Z1=μZ1(Y3,Y4)+σZ1(Y3,Y4)Y1Z2=μZ2(Y3,Y4)+σZ2(Y3,Y4)Y2Z3=Y3Z4=Y4)

40 / 47

Why Real NVP Works

T1 can be easily inverted (similar for T2): (X1X2X3X4)T1(Y1=X1Y2=X2Y3=μY3(X1,X2)+σY3(X1,X2)X3Y4=μY4(X1,X2)+σY4(X1,X2)X4)(X1=Y1X2=Y2X3=[Y3μY3(Y1,Y2)]/σY3(Y1,Y2)X4=[Y4μY4(Y1,Y2)]/σY4(Y1,Y2))T11(Y1Y2Y3Y4)

41 / 47

Why Real NVP Works

T1 is lower triangular: (X1X2X3X4)T1(Y1=X1Y2=X2Y3=μY3(X1,X2)+σY3(X1,X2)X3Y4=μY4(X1,X2)+σY4(X1,X2)X4)T1(x)=(10000100σY3(x1,x2)00σY4(x1,x2))

det(T1) can be computed in linear time (the product of diagonal elements).

42 / 47

Why Real NVP Works

Some theoretical works such as [7] show that Real NVP flows are universal approximators.

As a result, all three conditions are satisfied.

43 / 47

Parameterization

  • Construct the transport map Tθ using normalizing flows

  • The vector θ contains all the neural network parameters

  • θ^(θ) is typically computed using automatic differentiation

44 / 47

Extensions

Recent literature such as [8] shows that the measure transport sampler based on the KL loss function does not learn multimodal distributions well.

Some improved samplers exist, and more is yet to be explored.

45 / 47

References

[1] https://artowen.su.domains/mc/Ch-var-is.pdf

[2] Youssef Marzouk, Tarek Moselhy, Matthew Parno, and Alessio Spantini (2016). An introduction to sampling via measure transport. arXiv:1602.05023.

[3] Matthew Hoffman et al. (2019). Neutra-lizing bad geometry in Hamiltonian Monte Carlo using neural transport. arXiv preprint arXiv:1903.03704.

[4] George Papamakarios et al. (2021). Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research.

[5] Ivan Kobyzev, Simon J.D. Prince, and Marcus A. Brubaker (2020). Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence.

46 / 47

References

[6] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio (2016). Density estimation using Real NVP. arXiv preprint arXiv:1605.08803.

[7] Takeshi Teshima et al. (2020). Coupling-based invertible neural networks are universal diffeomorphism approximators. Advances in Neural Information Processing Systems 33.

[8] Yixuan Qiu and Xiao Wang (2023). Efficient multimodal sampling via tempered distribution flow. Journal of the American Statistical Association.

47 / 47

Simulation and Sampling

2 / 47
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow