Computational Statistics
Lecture 11
Yixuan Qiu
2023-11-29
1 / 47

Simulation and Sampling2 / 47

Today's Topics

Importance sampling
Measure transport sampler

3 / 47

Importance Sampling4 / 47

Importance sampling

Strictly speaking, importance sampling (IS) is not a method to obtain sample points $X_{1}, \dots, X_{n}$ that follow the target distribution $p (x)$ .

Instead, it is a technique to estimate expectations related to $p (x)$ :

$μ = E_{X} [f (X)] = \int f (x) p (x) d x, X \sim p (x) .$

5 / 47

A Direct Solution

Of course, one direct method to approximate $μ$ is to generate $X_{1}, \dots, X_{M} \sim p (x)$ , and then an unbiased estimator for $μ$ is given by

$\hat{μ} = \frac{1}{M} \sum_{i = 1}^{M} f (X_{i}), X_{i} \sim p (x), i = 1, \dots, M .$

Suppose that we use rejection sampling to get $X_{i}$ based on a proposal distribution $q (x)$ . There are two issues here:

Rejection sampling discards sample points
$f (x)$ may be close to zero outside a region $A$ for which $P (X \in A)$ is small

6 / 47

Example

We want to estimate $p = P (X > π)$ and $μ = E (X | X > π)$ , where $X \sim N (0, 1)$ .

Naive solution: sample $X_{1}, \dots, X_{M} \overset{i i d}{\sim} N (0, 1)$ , and get $\begin{aligned} \hat{p} & = \frac{1}{M} \sum_{i = 1}^{M} I {X_{i} > π}, \\ \hat{μ} & = \frac{\sum_{i = 1}^{M} X_{i} \cdot I {X_{i} > π}}{\sum_{i = 1}^{M} I {X_{i} > π}} . \end{aligned}$

Problem: $p$ is very small (true value ~0.00084), so maybe all $X_{i}$ 's are smaller than $π$ .

7 / 47

Example

est_naive = function(n)
{
    x = rnorm(n)
    p_hat = mean(x > pi)
    mu_hat = sum(x * (x > pi)) / sum(x > pi)
    c(p_hat, mu_hat)
}
set.seed(123)
est_naive(n = 100)

## [1]   0 NaN

8 / 47

Motivation

IS attempts to resolve the previous issues:

It does not discard or waste any sample points
Instead, it assigns different weights to each point
By properly choosing the proposal distribution, it is able to more effectively generate sample points around the "important region" $A$

9 / 47

Basic Idea

The idea of IS is in fact quite simple. It is based on a straightforward identity:

$μ = \int f (x) p (x) d x = \int \frac{f (x) p (x)}{q (x)} q (x) d x = E_{q} (\frac{f (X) p (X)}{q (X)}),$

where $q (x)$ is another density function that is positive on $R^{p}$ , and $E_{q} (\cdot)$ denotes the expectation for $X \sim q (x)$ .

Accordingly, the IS estimate for $μ$ is

${\hat{μ}}_{q} = \frac{1}{M} \sum_{i = 1}^{M} \frac{f (X_{i}) p (X_{i})}{q (X_{i})}, X_{i} \sim q (x), i = 1, \dots, M .$

10 / 47

Theorem ^[1]

Suppose that $q (x) > 0$ whenever $f (x) p (x) \neq 0$ . Then $E_{q} ({\hat{μ}}_{q}) = μ$ , and ${V a r}_{q} ({\hat{μ}}_{q}) = σ_{q}^{2} / n$ , where

$σ_{q}^{2} = \int_{Q} \frac{[f (x) p (x)]^{2}}{q (x)} d x - μ^{2} = \int_{Q} \frac{[f (x) p (x) - μ q (x)]^{2}}{q (x)} d x,$

and $Q = {x : q (x) > 0}$ .

11 / 47

Example

Back to the previous example, note that

$\begin{aligned} p & = \int_{- \infty}^{+ \infty} I (x > π) ϕ (x) d x \\ = \int_{π}^{+ \infty} \frac{I (x > π) ϕ (x)}{q (x)} q (x) d x = \int_{π}^{+ \infty} \frac{ϕ (x)}{q (x)} q (x) d x, \\ q (x) & = {\begin{cases} 0, & x \leq π \\ e^{- (x - π)}, & x > π \end{cases}, \end{aligned}$

where $q (x)$ is a shifted exponential distribution. Similarly,

$μ = p^{- 1} \int_{π}^{+ \infty} \frac{x \cdot ϕ (x)}{q (x)} q (x) d x .$

12 / 47

Example

est_is = function(n)
{
    x = rexp(n) + pi
    ratio = exp(dnorm(x, log = TRUE) + x - pi)
    p_hat = mean(ratio)
    mu_hat = mean(x * ratio) / p_hat
    c(p_hat, mu_hat)
}
set.seed(123)
est_is(n = 100)

## [1] 0.0007478989 3.4296517837

13 / 47

Optimal $q (x)$ ^[1]

It can be proved that the optimal proposal distribution $q^{*} (x)$ is given by $q^{*} (x) = | f (x) | p (x) / E_{p} (| f (X) |)$ .

Proof: for any density function $q (x)$ such that $q (x) > 0$ when $f (x) p (x) \neq 0$ ,

$\begin{aligned} μ^{2} + σ_{q^{*}}^{2} & = \int \frac{[f (x) p (x)]^{2}}{q^{*} (x)} d x = [E_{p} (| f (X) |)]^{2} \\ = {[E_{q} (\frac{| f (X) | p (X)}{q (X)})]}^{2} \\ \leq E_{q} (\frac{[f (X) p (X)]^{2}}{[q (X)]^{2}}) = \int \frac{[f (x) p (x)]^{2}}{q (x)} d x = μ^{2} + σ_{q}^{2} . \end{aligned}$

14 / 47

Optimal $q (x)$

This means that to approximate $\int f (x) p (x) d x$ , IS can be better than the simple Monte Carlo estimator!

15 / 47

Self-Normalized IS

Suppose that we can only compute $p_{u} (x) \propto p (x)$ and $q_{u} (x) \propto q (x)$ , the self-normalized IS estimate is given by

$\tilde{μ} = \frac{\sum_{i = 1}^{M} f (X_{i}) w (X_{i})}{\sum_{i = 1}^{M} w (X_{i})},$

where $w (x) = p_{u} (x) / q_{u} (x)$ and $X_{i} \sim q (x)$ .

Under mild conditions, $\tilde{μ}$ is a consistent estimator of $μ$ , but in general $\tilde{μ}$ is no longer unbiased.

16 / 47

Measure Transport Sampler17 / 47

Recap: Inverse Transform Algorithm

$U \sim U n i f (0, 1)$
$g = F^{- 1}$
$X = g (U) \Rightarrow X \sim f (x)$
$f (x) = F^{'} (x)$ is the density function

18 / 47

Random Variable Transformation

Continuous random variable $X \sim p_{X} (x)$
$p_{X} (x)$ density function
$g : R \to R$ a monotone function
Define $Y = g (X)$ , then its density function is given by $p_{Y} (y) = p_{X} (g^{- 1} (y)) | \frac{d}{d y} g^{- 1} (y) |$
Can extend to multivariate case

19 / 47

Random Vector Transformation

Continuous random vector $X \in R^{d}$ , $X \sim p_{X} (x)$
$p_{X} (x)$ density function
$T : R^{d} \to R^{d}$ a diffeomorphism (a smooth mapping with smooth inverse)
Define $Y = T (X)$ , then its density function is given by $\begin{aligned} p_{Y} (y) & = p_{X} (T^{- 1} (y)) | det (\nabla (T^{- 1}) (y)) | \\ = p_{X} (x) {| det (\nabla T (x)) |}^{- 1}, x = T^{- 1} (y) \end{aligned}$
$\nabla T$ (cf. $\nabla T^{- 1}$ ) is the Jacobian matrix of $T$ (cf. $T^{- 1}$ )

20 / 47

Recall: Box-Muller Transform

$(U_{1}, U_{2}) \sim U n i f ([0, 1]^{2})$
Let $Z_{1} = \sqrt{- 2 \log (U_{1})} \cos (2 π U_{2})$ and $Z_{2} = \sqrt{- 2 \log (U_{1})} \sin (2 π U_{2})$
Then $Z_{1}$ and $Z_{2}$ are two independent $N (0, 1)$ random variables

21 / 47

Recall: Box-Muller Transform

Transformation mapping $T (\begin{matrix} U_{1} \\ U_{2} \end{matrix}) = (\begin{matrix} \sqrt{- 2 \log (U_{1})} \cos (2 π U_{2}) \\ \sqrt{- 2 \log (U_{1})} \sin (2 π U_{2}) \end{matrix})$
Jacobian matrix $\begin{aligned} \nabla T (\begin{matrix} U_{1} \\ U_{2} \end{matrix}) & = (\begin{array}{cc} \frac{- \cos (θ)}{U_{1} L} & - 2 π L \sin (θ) \\ \frac{- \sin (θ)}{U_{1} L} & 2 π L \cos (θ) \end{array}) \\ L & = \sqrt{- 2 \log (U_{1})}, θ = 2 π U_{2} \end{aligned}$
Determinant $det \nabla T = - 2 π U_{1}^{- 1}$

22 / 47

Recall: Box-Muller Transform

Clearly, $p_{U} (u_{1}, u_{2}) = 1$ , so $p_{Z} (z_{1}, z_{2}) = p_{U} (u_{1}, u_{2}) | det \nabla T |^{- 1} = (2 π)^{- 1} u_{1}$
To express $u_{1}$ using $z_{1}$ and $z_{2}$ , note that $Z_{1}^{2} + Z_{2}^{2} = - 2 \log (U_{1})$
Therefore, $p_{Z} (z_{1}, z_{2}) = \frac{1}{2 π} e^{- \frac{1}{2} (z_{1}^{2} + z_{2}^{2})},$ which implies that $Z_{1}$ and $Z_{2}$ follow independent $N (0, 1)$

23 / 47

Ideas from Inverse Transform

Fix a source distribution $p_{Z} (z)$ , e.g. $N (0, I)$
Given a target distribution $p (x)$
Can we learn a mapping $T$ such that $X = T (Z) \sim p (x)$ if $Z \sim p_{Z} (z)$ ?

24 / 47

Measure Transport Sampler

If we can obtain such a mapping, sampling would be easy:
- Simulate $Z_{1}, \dots, Z_{n} \sim N (0, I)$
- Set $X_{1} = T (Z_{1}), \dots, X_{n} = T (Z_{n})$
- Then $X_{i} \overset{i i d}{\sim} p (x)$
The key is to find such a mapping $T$ , also called a transport map

25 / 47

Estimating Transport Map

Fix a source distribution $Z \sim p_{0} (z)$ , e.g. $N (0, I)$
Assume we have a model $T_{θ} \in T$ to parameterize $T$
$T$ is chosen such that $T_{θ}$ is a diffeomorphism (we will cover this later)
Then $X = T_{θ} (Z) \sim p_{θ} (x)$ has a known density function
The problem becomes how to estimate $θ$

26 / 47

Criterion: KL Divergence

Given target distribution $p (x)$
We want to make $p_{θ} (x)$ close to $p (x)$ as much as possible
A natural criterion is the Kullback-Leibler (KL) divergence $D_{K L} (p_{θ} ‖ p) = \int p_{θ} (x) \log \frac{p_{θ} (x)}{p (x)} d x = E_{p_{θ}} (\log \frac{p_{θ}}{p})$
The best $θ$ is given by $θ^{*} = \underset{θ}{\arg min} D_{K L} (p_{θ} ‖ p)$

27 / 47

Training Algorithm

An unbiased estimator for $D_{K L} (p_{θ} ‖ p)$ is given by $\hat{L} (θ) = \frac{1}{M} \sum_{i = 1}^{M} [\log p_{θ} (X_{i}) - \log p (X_{i})],$ where $X_{i} = T_{θ} (Z_{i}), Z_{i} \sim N (0, I)$
We further have $p_{θ} (x) = p_{0} (z) {| det (\nabla T_{θ} (z)) |}^{- 1}, z = T_{θ}^{- 1} (x)$

28 / 47

Training Algorithm

As a result, $\begin{aligned} \hat{L} (θ) & = \frac{1}{M} \sum_{i = 1}^{M} [\log p_{0} (Z_{i}) - \log | det (\nabla T_{θ} (Z_{i})) | - \log p (X_{i})] \\ = - \frac{1}{M} \sum_{i = 1}^{M} [\log | det (\nabla T_{θ} (Z_{i})) | + \log p (T_{θ} (Z_{i}))] + C \end{aligned}$
Therefore, $\begin{aligned} E_{Z} \hat{L} (θ) & = D_{K L} (p_{θ} ‖ p) \\ E_{Z} (\nabla_{θ} \hat{L} (θ)) & = \nabla_{θ} D_{K L} (p_{θ} ‖ p) \end{aligned}$

29 / 47

Stochastic Gradient Descent

We can then use stochastic gradient descent to optimize $D_{K L} (p_{θ} ‖ p)$ .

In each iteration:

Simulate $Z_{i} \overset{i i d}{\sim} N (0, I)$
Define $\hat{ℓ} (θ) = - \frac{1}{M} \sum_{i = 1}^{M} [\log | det (\nabla T_{θ} (Z_{i})) | + \log p (T_{θ} (Z_{i}))]$
Compute $g (θ) = \nabla_{θ} \hat{ℓ} (θ)$
Update $θ \leftarrow θ - α \cdot g (θ)$

30 / 47

Unnormalized Target Density

In fact, the training algorithm is also valid for unnormalized target densities.
Assume $p (x) \propto e^{- E (x)}$
Then just change $\hat{ℓ} (θ)$ to $\hat{ℓ} (θ) = - \frac{1}{M} \sum_{i = 1}^{M} [\log | det (\nabla T_{θ} (Z_{i})) | - E (T_{θ} (Z_{i}))]$
This is because the unknown constant does not affect the gradient

31 / 47

Demonstration
  
32 / 47

Modeling Transport Maps33 / 47

Modeling Transport Maps

For practical use, the transport map $T_{θ}$ needs to have some "nice" properties:

$T_{θ}$ should be invertible and differentiable
$T_{θ}^{- 1}$ and $det (\nabla T_{θ})$ should be easy to compute
$T_{θ}$ should be flexible enough to characterize sophisticated nonlinear mappings

At first glance, these are quite strong conditions.

34 / 47

Modeling Transport Maps

Polynomials ^[2] (not good enough)
Normalizing flows ^[3] (tools from the deep learning community)

35 / 47

Normalizing Flow

Normalizing flows (NFs) are a class of diffeomorphisms constructed by neural networks
NFs define invertible mappings through composition: $T = T_{K} \circ \dots \circ T_{1}$
Each $T_{i}$ is a simpler diffeomorphism

36 / 47

Change-of-Variable RevisitedRecall the density formula for X=T(Z)X=T(Z):
pX(x)=pZ(z)|det(∇T(z))|−1,z=T−1(x)pX(x)=pZ(z)|det(∇T(z))|−1,z=T−1(x)
For T=TK∘⋯∘T1T=TK∘⋯∘T1, let z0=zz0=z, zK=xzK=x, and zi=Ti(zi−1)zi=Ti(zi−1), then
log|det(∇T(z))|=K∑i=1log|det(∇Ti(zi−1))|log⁡|det(∇T(z))|=∑i=1Klog⁡|det(∇Ti(zi−1))|
So we still obtain an explicit form for pX(x)pX(x)
37 / 47

Implementations

Many different implementations of NFs ^[4,5]:

Permutation and orthogonal
Decomposition-based
Planar and radial
Coupling
Autoregressive
...

38 / 47

Example: Real NVP

One popular implementation of NF is the affine coupling flow, also called Real NVP ^[6]
Given $X \in R^{d}$ , define $Y = T (X) \in R^{d}$ as below: $\begin{aligned} Y_{1 : r} & = X_{1 : r} \\ Y_{(r + 1) : d} & = μ (X_{1 : r}) + σ (X_{1 : r}) ⊙ X_{(r + 1) : d} \end{aligned}$
$X_{1 : r}$ is the first $r$ elements of $X$ , $r < d$
$⊙$ is elementwise multiplication
$μ, σ : R^{r} \to R^{d - r}$ are two neural networks, $σ (\cdot) > 0$

39 / 47

Why Real NVP Works

Consider the composition of two transformations: $\begin{aligned} (\begin{matrix} X_{1} \\ X_{2} \\ X_{3} \\ X_{4} \end{matrix}) & \overset{T_{1}}{\Rightarrow} (\begin{matrix} Y_{1} = X_{1} \\ Y_{2} = X_{2} \\ Y_{3} = μ_{Y_{3}} (X_{1}, X_{2}) + σ_{Y_{3}} (X_{1}, X_{2}) \cdot X_{3} \\ Y_{4} = μ_{Y_{4}} (X_{1}, X_{2}) + σ_{Y_{4}} (X_{1}, X_{2}) \cdot X_{4} \end{matrix}) \\ (\begin{matrix} Y_{1} \\ Y_{2} \\ Y_{3} \\ Y_{4} \end{matrix}) & \overset{T_{2}}{\Rightarrow} (\begin{matrix} Z_{1} = μ_{Z_{1}} (Y_{3}, Y_{4}) + σ_{Z_{1}} (Y_{3}, Y_{4}) \cdot Y_{1} \\ Z_{2} = μ_{Z_{2}} (Y_{3}, Y_{4}) + σ_{Z_{2}} (Y_{3}, Y_{4}) \cdot Y_{2} \\ Z_{3} = Y_{3} \\ Z_{4} = Y_{4} \end{matrix}) \end{aligned}$

40 / 47

Why Real NVP Works

$T_{1}$ can be easily inverted (similar for $T_{2}$ ): $\begin{aligned} (\begin{matrix} X_{1} \\ X_{2} \\ X_{3} \\ X_{4} \end{matrix}) \overset{T_{1}}{\Rightarrow} (\begin{matrix} Y_{1} = X_{1} \\ Y_{2} = X_{2} \\ Y_{3} = μ_{Y_{3}} (X_{1}, X_{2}) + σ_{Y_{3}} (X_{1}, X_{2}) \cdot X_{3} \\ Y_{4} = μ_{Y_{4}} (X_{1}, X_{2}) + σ_{Y_{4}} (X_{1}, X_{2}) \cdot X_{4} \end{matrix}) \\ (\begin{matrix} X_{1} = Y_{1} \\ X_{2} = Y_{2} \\ X_{3} = [Y_{3} - μ_{Y_{3}} (Y_{1}, Y_{2})] / σ_{Y_{3}} (Y_{1}, Y_{2}) \\ X_{4} = [Y_{4} - μ_{Y_{4}} (Y_{1}, Y_{2})] / σ_{Y_{4}} (Y_{1}, Y_{2}) \end{matrix}) \overset{T_{1}^{- 1}}{\Leftarrow} (\begin{matrix} Y_{1} \\ Y_{2} \\ Y_{3} \\ Y_{4} \end{matrix}) \end{aligned}$

41 / 47

Why Real NVP Works

$\nabla T_{1}$ is lower triangular: $\begin{aligned} (\begin{matrix} X_{1} \\ X_{2} \\ X_{3} \\ X_{4} \end{matrix}) & \overset{T_{1}}{\Rightarrow} (\begin{matrix} Y_{1} = X_{1} \\ Y_{2} = X_{2} \\ Y_{3} = μ_{Y_{3}} (X_{1}, X_{2}) + σ_{Y_{3}} (X_{1}, X_{2}) \cdot X_{3} \\ Y_{4} = μ_{Y_{4}} (X_{1}, X_{2}) + σ_{Y_{4}} (X_{1}, X_{2}) \cdot X_{4} \end{matrix}) \\ \nabla T_{1} (x) & = (\begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ * & * & σ_{Y_{3}} (x_{1}, x_{2}) & 0 \\ * & * & 0 & σ_{Y_{4}} (x_{1}, x_{2}) \end{array}) \end{aligned}$

$det (\nabla T_{1})$ can be computed in linear time (the product of diagonal elements).

42 / 47

Why Real NVP Works

Some theoretical works such as [7] show that Real NVP flows are universal approximators.

As a result, all three conditions are satisfied.

43 / 47

Parameterization

Construct the transport map $T_{θ}$ using normalizing flows
The vector $θ$ contains all the neural network parameters
$\nabla_{θ} \hat{ℓ} (θ)$ is typically computed using automatic differentiation

44 / 47

Extensions

Recent literature such as [8] shows that the measure transport sampler based on the KL loss function does not learn multimodal distributions well.

Some improved samplers exist, and more is yet to be explored.

45 / 47

References

[1] https://artowen.su.domains/mc/Ch-var-is.pdf

[2] Youssef Marzouk, Tarek Moselhy, Matthew Parno, and Alessio Spantini (2016). An introduction to sampling via measure transport. arXiv:1602.05023.

[3] Matthew Hoffman et al. (2019). Neutra-lizing bad geometry in Hamiltonian Monte Carlo using neural transport. arXiv preprint arXiv:1903.03704.

[4] George Papamakarios et al. (2021). Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research.

[5] Ivan Kobyzev, Simon J.D. Prince, and Marcus A. Brubaker (2020). Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence.

46 / 47

References

[6] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio (2016). Density estimation using Real NVP. arXiv preprint arXiv:1605.08803.

[7] Takeshi Teshima et al. (2020). Coupling-based invertible neural networks are universal diffeomorphism approximators. Advances in Neural Information Processing Systems 33.

[8] Yixuan Qiu and Xiao Wang (2023). Efficient multimodal sampling via tempered distribution flow. Journal of the American Statistical Association.

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Computational Statistics

Lecture 11

Yixuan Qiu

2023-11-29

Simulation and Sampling

Today's Topics

Importance Sampling

Importance sampling

A Direct Solution

Example

Example

Motivation

Basic Idea

Theorem [1]

Example

Example

Optimal q(x)q(x) [1]

Optimal q(x)q(x)

Self-Normalized IS

Measure Transport Sampler

Recap: Inverse Transform Algorithm

Random Variable Transformation

Random Vector Transformation

Recall: Box-Muller Transform

Recall: Box-Muller Transform

Recall: Box-Muller Transform

Ideas from Inverse Transform

Measure Transport Sampler

Estimating Transport Map

Criterion: KL Divergence

Training Algorithm

Training Algorithm

Stochastic Gradient Descent

Unnormalized Target Density

Demonstration

Modeling Transport Maps

Modeling Transport Maps

Modeling Transport Maps

Normalizing Flow

Change-of-Variable Revisited

Implementations

Example: Real NVP

Why Real NVP Works

Why Real NVP Works

Why Real NVP Works

Why Real NVP Works

Parameterization

Extensions

References

References

Simulation and Sampling

Help

Theorem ^[1]

Optimal $q (x)$ ^[1]

Optimal $q (x)$