class: center, middle, inverse, title-slide .title[ # Computational Statistics ] .subtitle[ ## Lecture 11 ] .author[ ### Yixuan Qiu ] .date[ ### 2023-11-29 ] --- class: inverse, center, middle # Simulation and Sampling --- # Today's Topics - Importance sampling - Measure transport sampler --- class: inverse, center, middle # Importance Sampling --- # Importance sampling Strictly speaking, importance sampling (IS) is not a method to obtain sample points `\(X_1,\ldots,X_n\)` that follow the target distribution `\(p(x)\)`. Instead, it is a technique to estimate expectations related to `\(p(x)\)`: `$$\mu=\mathbb{E}_X[f(X)]=\int f(x)p(x)\mathrm{d}x,\quad X\sim p(x).$$` --- # A Direct Solution Of course, one direct method to approximate `\(\mu\)` is to generate `\(X_1,\ldots,X_M\sim p(x)\)`, and then an unbiased estimator for `\(\mu\)` is given by `$$\hat{\mu}=\frac{1}{M}\sum_{i=1}^M f(X_i),\quad X_i\sim p(x),\ i=1,\ldots,M.$$` Suppose that we use rejection sampling to get `\(X_i\)` based on a proposal distribution `\(q(x)\)`. There are two issues here: - Rejection sampling discards sample points - `\(f(x)\)` may be close to zero outside a region `\(A\)` for which `\(P(X\in A)\)` is small --- # Example We want to estimate `\(p=P(X>\pi)\)` and `\(\mu=\mathbb{E}(X|X>\pi)\)`, where `\(X\sim N(0,1)\)`. Naive solution: sample `\(X_1,\ldots,X_M\overset{iid}{\sim}N(0,1)\)`, and get `$$\begin{align*} \hat{p} & =\frac{1}{M}\sum_{i=1}^{M}I\{X_{i}>\pi\},\\ \hat{\mu} & =\frac{\sum_{i=1}^{M}X_{i}\cdot I\{X_{i}>\pi\}}{\sum_{i=1}^{M}I\{X_{i}>\pi\}}. \end{align*}$$` Problem: `\(p\)` is very small (true value ~0.00084), so maybe all `\(X_i\)`'s are smaller than `\(\pi\)`. --- # Example ```r est_naive = function(n) { x = rnorm(n) p_hat = mean(x > pi) mu_hat = sum(x * (x > pi)) / sum(x > pi) c(p_hat, mu_hat) } set.seed(123) est_naive(n = 100) ``` ``` ## [1] 0 NaN ``` --- # Motivation IS attempts to resolve the previous issues: - It does not discard or waste any sample points - Instead, it assigns different weights to each point - By properly choosing the proposal distribution, it is able to more effectively generate sample points around the "important region" `\(A\)` --- # Basic Idea The idea of IS is in fact quite simple. It is based on a straightforward identity: `$$\mu=\int f(x)p(x)\mathrm{d}x=\int\frac{f(x)p(x)}{q(x)}q(x)\mathrm{d}x=\mathbb{E}_{q}\left(\frac{f(X)p(X)}{q(X)}\right),$$` where `\(q(x)\)` is another density function that is positive on `\(\mathbb{R}^p\)`, and `\(\mathbb{E}_q(\cdot)\)` denotes the expectation for `\(X\sim q(x)\)`. Accordingly, the IS estimate for `\(\mu\)` is `$$\require{color}\hat{\mu}_q=\frac{1}{M}\sum_{i=1}^M \frac{f(X_i)p(X_i)}{q(X_i)},\quad X_i\sim \textcolor{deeppink}{q(x)},\ i=1,\ldots,M.$$` --- # Theorem <sup><span class="small">[1]</span></sup> Suppose that `\(q(x)>0\)` whenever `\(f(x)p(x)\neq 0\)`. Then `\(\mathbb{E}_q(\hat{\mu}_q)=\mu\)`, and `\(\mathrm{Var}_q(\hat{\mu}_q)=\sigma_q^2/n\)`, where `$$\sigma_{q}^{2}=\int_{\mathcal{Q}}\frac{[f(x)p(x)]^{2}}{q(x)}\mathrm{d}x-\mu^{2}=\int_{\mathcal{Q}}\frac{[f(x)p(x)-\mu q(x)]^{2}}{q(x)}\mathrm{d}x,$$` and `\(\mathcal{Q}=\{x:q(x)>0\}\)`. --- # Example Back to the previous example, note that `$$\begin{align*} p & =\int_{-\infty}^{+\infty}I(x>\pi)\phi(x)\mathrm{d}x\\ & =\int_{\pi}^{+\infty}\frac{I(x>\pi)\phi(x)}{q(x)}q(x)\mathrm{d}x=\int_{\pi}^{+\infty}\frac{\phi(x)}{q(x)}q(x)\mathrm{d}x,\\ q(x) & =\begin{cases} 0, & x\le\pi\\ e^{-(x-\pi)}, & x>\pi \end{cases}, \end{align*}$$` where `\(q(x)\)` is a shifted exponential distribution. Similarly, `$$\mu=p^{-1}\int_{\pi}^{+\infty}\frac{x\cdot\phi(x)}{q(x)}q(x)\mathrm{d}x.$$` --- # Example ```r est_is = function(n) { x = rexp(n) + pi ratio = exp(dnorm(x, log = TRUE) + x - pi) p_hat = mean(ratio) mu_hat = mean(x * ratio) / p_hat c(p_hat, mu_hat) } set.seed(123) est_is(n = 100) ``` ``` ## [1] 0.0007478989 3.4296517837 ``` --- # Optimal `\(q(x)\)` <sup><span class="small">[1]</span></sup> It can be proved that the optimal proposal distribution `\(q^*(x)\)` is given by `\(q^*(x)=|f(x)|p(x)/\mathbb{E}_p (|f(X)|)\)`. Proof: for any density function `\(q(x)\)` such that `\(q(x)>0\)` when `\(f(x)p(x)\neq 0\)`, `$$\begin{align*} \mu^{2}+\sigma_{q^{*}}^{2} & =\int\frac{[f(x)p(x)]^{2}}{q^{*}(x)}\mathrm{d}x=[\mathbb{E}_{p}(|f(X)|)]^{2}\\ & =\left[\mathbb{E}_{q}\left(\frac{|f(X)|p(X)}{q(X)}\right)\right]^{2}\\ & \le\mathbb{E}_{q}\left(\frac{[f(X)p(X)]^{2}}{[q(X)]^{2}}\right)=\int\frac{[f(x)p(x)]^{2}}{q(x)}\mathrm{d}x=\mu^{2}+\sigma_{q}^{2}. \end{align*}$$` --- # Optimal `\(q(x)\)` This means that to approximate `\(\int f(x)p(x)\mathrm{d}x\)`, IS can be better than the simple Monte Carlo estimator! --- # Self-Normalized IS Suppose that we can only compute `\(p_u(x)\propto p(x)\)` and `\(q_u(x)\propto q(x)\)`, the .highlight[self-normalized IS estimate] is given by `$$\tilde{\mu}=\frac{\sum_{i=1}^{M}f(X_{i})w(X_{i})}{\sum_{i=1}^{M}w(X_{i})},$$` where `\(w(x)=p_u(x)/q_u(x)\)` and `\(X_i\sim q(x)\)`. Under mild conditions, `\(\tilde{\mu}\)` is a consistent estimator of `\(\mu\)`, but in general `\(\tilde{\mu}\)` .highlight[is no longer unbiased]. --- class: inverse, center, middle # Measure Transport Sampler --- # Recap: Inverse Transform Algorithm - `\(U\sim Unif(0,1)\)` - `\(g=F^{-1}\)` - `\(X=g(U)\Rightarrow X\sim f(x)\)` - `\(f(x)=F'(x)\)` is the density function --- # Random Variable Transformation - Continuous random variable `\(X\sim p_X(x)\)` - `\(p_X(x)\)` density function - `\(g:\mathbb{R}\rightarrow\mathbb{R}\)` a .highlight[monotone] function - Define `\(Y=g(X)\)`, then its density function is given by `$$p_{Y}(y)=p_{X}(g^{-1}(y))\left|\frac{\mathrm{d}}{\mathrm{d}y}g^{-1}(y)\right|$$` - Can extend to multivariate case --- # Random Vector Transformation - Continuous random vector `\(X\in\mathbb{R}^d\)`, `\(X\sim p_X(x)\)` - `\(p_X(x)\)` density function - `\(T:\mathbb{R}^d\rightarrow\mathbb{R}^d\)` a .highlight[diffeomorphism] (a smooth mapping with smooth inverse) - Define `\(Y=T(X)\)`, then its density function is given by `$$\begin{align*} p_{Y}(y) & =p_{X}(T^{-1}(y))\left|\det\left(\nabla(T^{-1})(y)\right)\right|\\ & =p_{X}(x)\left|\det\left(\nabla T(x)\right)\right|^{-1},\quad x=T^{-1}(y) \end{align*}$$` - `\(\nabla T\)` (cf. `\(\nabla T^{-1}\)`) is the Jacobian matrix of `\(T\)` (cf. `\(T^{-1}\)`) --- # Recall: Box-Muller Transform 1. `\((U_1,U_2)\sim \mathrm{Unif}([0,1]^2)\)` 2. Let `\(Z_1=\sqrt{-2\log(U_1)}\cos(2\pi U_2)\)` and `\(Z_2=\sqrt{-2\log(U_1)}\sin(2\pi U_2)\)` 3. Then `\(Z_1\)` and `\(Z_2\)` are two .highlight[independent] `\(N(0,1)\)` random variables --- # Recall: Box-Muller Transform - Transformation mapping `$$\small T\left(\begin{array}{c} U_{1}\\ U_{2} \end{array}\right)=\left(\begin{array}{c} \sqrt{-2\log(U_{1})}\cos(2\pi U_{2})\\ \sqrt{-2\log(U_{1})}\sin(2\pi U_{2}) \end{array}\right)$$` - Jacobian matrix `$$\small \begin{align*} \nabla T\left(\begin{array}{c} U_{1}\\ U_{2} \end{array}\right) & =\left(\begin{array}{cc} \frac{-\cos(\theta)}{U_{1}L} & -2\pi L\sin(\theta)\\ \frac{-\sin(\theta)}{U_{1}L} & 2\pi L\cos(\theta) \end{array}\right)\\ L & =\sqrt{-2\log(U_{1})},\quad\theta=2\pi U_{2} \end{align*}$$` - Determinant `$$\small \det\nabla T=-2\pi U_{1}^{-1}$$` --- # Recall: Box-Muller Transform - Clearly, `\(p_U(u_1,u_2)=1\)`, so `$$p_{Z}(z_{1},z_{2})=p_U(u_1,u_2)|\det\nabla T|^{-1}=(2\pi)^{-1}u_1$$` - To express `\(u_1\)` using `\(z_1\)` and `\(z_2\)`, note that `$$Z_1^2+Z_2^2=-2\log(U_{1})$$` - Therefore, `$$p_{Z}(z_{1},z_{2})=\frac{1}{2\pi}e^{-\frac{1}{2}(z_{1}^{2}+z_{2}^{2})},$$` which implies that `\(Z_1\)` and `\(Z_2\)` follow independent `\(N(0,1)\)` --- # Ideas from Inverse Transform - Fix a source distribution `\(p_Z(z)\)`, e.g. `\(N(0,I)\)` - Given a target distribution `\(p(x)\)` - Can we .highlight[learn a mapping] `\(T\)` such that `\(X=T(Z)\sim p(x)\)` if `\(Z\sim p_Z(z)\)`? --- # Measure Transport Sampler - If we can obtain such a mapping, sampling would be easy: - Simulate `\(Z_1,\ldots,Z_n\sim N(0,I)\)` - Set `\(X_1=T(Z_1),\ldots,X_n=T(Z_n)\)` - Then `\(X_i\overset{iid}{\sim}p(x)\)` - The key is to find such a mapping `\(T\)`, also called a .highlight[transport map] --- # Estimating Transport Map - Fix a source distribution `\(Z\sim p_0(z)\)`, e.g. `\(N(0,I)\)` - Assume we have a model `\(T_{\theta}\in\mathcal{T}\)` to parameterize `\(T\)` - `\(\mathcal{T}\)` is chosen such that `\(T_{\theta}\)` is a .highlight[diffeomorphism] (we will cover this later) - Then `\(X=T_{\theta}(Z)\sim p_{\theta}(x)\)` has a known density function - The problem becomes how to estimate `\(\theta\)` --- # Criterion: KL Divergence - Given target distribution `\(p(x)\)` - We want to make `\(p_\theta(x)\)` close to `\(p(x)\)` as much as possible - A natural criterion is the Kullback-Leibler (KL) divergence `$$\mathcal{D}_{\mathrm{KL}}(p_{\theta}\Vert p)=\int p_\theta(x)\log\frac{p_\theta(x)}{p(x)}\mathrm{d}x=\mathbb{E}_{p_\theta}\left(\log\frac{p_\theta}{p}\right)$$` - The best `\(\theta\)` is given by `$$\theta^*=\underset{\theta}{\arg\min}\ \mathcal{D}_{\mathrm{KL}}(p_{\theta}\Vert p)$$` --- # Training Algorithm - An unbiased estimator for `\(\mathcal{D}_{\mathrm{KL}}(p_{\theta}\Vert p)\)` is given by `$$\hat{L}(\theta)=\frac{1}{M}\sum_{i=1}^M \left[ \log p_\theta(X_i)-\log p(X_i)\right],$$` where `\(X_i=T_{\theta}(Z_i),\ Z_i\sim N(0,I)\)` - We further have `$$p_{\theta}(x)=p_0(z)\left|\det\left(\nabla T_{\theta}(z)\right)\right|^{-1},\quad z=T_{\theta}^{-1}(x)$$` --- # Training Algorithm - As a result, `$$\begin{align}\hat{L}(\theta)&=\frac{1}{M}\sum_{i=1}^M \left[ \log p_0(Z_i)-\log|\det\left(\nabla T_{\theta}(Z_i)\right)|-\log p(X_i)\right]\\ &=-\frac{1}{M}\sum_{i=1}^M \left[ \log|\det\left(\nabla T_{\theta}(Z_i)\right)|+\log p(T_{\theta}(Z_i))\right]+C\end{align}$$` - Therefore, `$$\begin{align}\mathbb{E}_Z \hat{L}(\theta)&=\mathcal{D}_{\mathrm{KL}}(p_{\theta}\Vert p)\\ \mathbb{E}_Z \left(\nabla_\theta\hat{L}(\theta)\right)&=\nabla_\theta\mathcal{D}_{\mathrm{KL}}(p_{\theta}\Vert p)\end{align}$$` --- # Stochastic Gradient Descent We can then use stochastic gradient descent to optimize `\(\mathcal{D}_{\mathrm{KL}}(p_{\theta}\Vert p)\)`. In each iteration: - Simulate `\(Z_i\overset{iid}{\sim} N(0,I)\)` - Define `$$\hat{\ell}(\theta)=-\frac{1}{M}\sum_{i=1}^M \left[ \log|\det\left(\nabla T_{\theta}(Z_i)\right)|+\log p(T_{\theta}(Z_i))\right]$$` - Compute `\(g(\theta)=\nabla_{\theta}\hat{\ell}(\theta)\)` - Update `\(\theta \leftarrow \theta-\alpha\cdot g(\theta)\)` --- # Unnormalized Target Density - In fact, the training algorithm is also valid for .highlight[unnormalized] target densities. - Assume `\(p(x)\propto e^{-E(x)}\)` - Then just change `\(\hat{\ell}(\theta)\)` to `$$\hat{\ell}(\theta)=-\frac{1}{M}\sum_{i=1}^M \left[ \log|\det\left(\nabla T_{\theta}(Z_i)\right)|-E(T_{\theta}(Z_i))\right]$$` - This is because the unknown constant does not affect the gradient --- # Demonstration <div style="float:left; width:60%;"> <video controls autoplay loop src="images/density.mp4"></video> </div> <div style="float:right; width:35%;"> <video controls autoplay loop src="images/transformation.mp4"></video> </div> --- class: inverse, middle, center # Modeling Transport Maps --- # Modeling Transport Maps For practical use, the transport map `\(T_{\theta}\)` needs to have some "nice" properties: - `\(T_{\theta}\)` should be .highlight[invertible] and .highlight[differentiable] - `\(T_{\theta}^{-1}\)` and `\(\det(\nabla T_{\theta})\)` should be easy to compute - `\(T_{\theta}\)` should be flexible enough to characterize sophisticated nonlinear mappings At first glance, these are quite strong conditions. --- # Modeling Transport Maps - Polynomials <sup><span class="small">[2]</span></sup> (not good enough) - Normalizing flows <sup><span class="small">[3]</span></sup> (tools from the deep learning community) --- # Normalizing Flow - Normalizing flows (NFs) are a class of diffeomorphisms constructed by neural networks - NFs define invertible mappings through .highlight[composition]: `$$T=T_K\circ\cdots\circ T_1$$` - Each `\(T_i\)` is a simpler diffeomorphism --- # Change-of-Variable Revisited - Recall the density formula for `\(X=T(Z)\)`: `$$p_X(x)=p_Z(z)\left|\det\left(\nabla T(z)\right)\right|^{-1},\quad z=T^{-1}(x)$$` - For `\(T=T_K\circ\cdots\circ T_1\)`, let `\(z_0=z\)`, `\(z_K=x\)`, and `\(z_i=T_i(z_{i-1})\)`, then `$$\log\left|\det\left(\nabla T(z)\right)\right|=\sum_{i=1}^K \log\left|\det\left(\nabla T_i(z_{i-1})\right)\right|$$` - So we still obtain an explicit form for `\(p_X(x)\)` --- # Implementations Many different implementations of NFs <sup><span class="small">[4,5]</span></sup>: - Permutation and orthogonal - Decomposition-based - Planar and radial - Coupling - Autoregressive - ... --- # Example: Real NVP - One popular implementation of NF is the affine coupling flow, also called Real NVP <sup><span class="small">[6]</span></sup> - Given `\(X\in\mathbb{R}^d\)`, define `\(Y=T(X)\in\mathbb{R}^d\)` as below: `$$\begin{align}Y_{1:r}&=X_{1:r}\\Y_{(r+1):d}&=\mu(X_{1:r})+\sigma(X_{1:r})\odot X_{(r+1):d}\end{align}$$` - `\(X_{1:r}\)` is the first `\(r\)` elements of `\(X\)`, `\(r<d\)` - `\(\odot\)` is elementwise multiplication - `\(\mu,\sigma:\mathbb{R}^r\rightarrow\mathbb{R}^{d-r}\)` are two neural networks, `\(\sigma(\cdot)>0\)` --- # Why Real NVP Works Consider the composition of two transformations: `$$\require{color}\begin{align*} \left(\begin{matrix}{\color{red}X_{1}}\\ {\color{red}X_{2}}\\ X_{3}\\ X_{4} \end{matrix}\right) & \overset{T_{1}}{\Rightarrow}\left(\begin{matrix}{\color{red}Y_{1}}=X_{1}\\ {\color{red}Y_{2}}=X_{2}\\ Y_{3}=\mu_{Y_{3}}(X_{1},X_{2})+\sigma_{Y_{3}}(X_{1},X_{2})\cdot X_{3}\\ Y_{4}=\mu_{Y_{4}}(X_{1},X_{2})+\sigma_{Y_{4}}(X_{1},X_{2})\cdot X_{4} \end{matrix}\right)\\ \left(\begin{matrix}Y_{1}\\ Y_{2}\\ {\color{blue}Y_{3}}\\ {\color{blue}Y_{4}} \end{matrix}\right) & \overset{T_{2}}{\Rightarrow}\left(\begin{matrix}Z_{1}=\mu_{Z_{1}}(Y_{3},Y_{4})+\sigma_{Z_{1}}(Y_{3},Y_{4})\cdot Y_{1}\\ Z_{2}=\mu_{Z_{2}}(Y_{3},Y_{4})+\sigma_{Z_{2}}(Y_{3},Y_{4})\cdot Y_{2}\\ {\color{blue}Z_{3}}=Y_{3}\\ {\color{blue}Z_{4}}=Y_{4} \end{matrix}\right) \end{align*}$$` --- # Why Real NVP Works `\(T_1\)` can be easily inverted (similar for `\(T_2\)`): `$$\require{color}\begin{align*} \left(\begin{matrix}{\color{red}X_{1}}\\ {\color{red}X_{2}}\\ X_{3}\\ X_{4} \end{matrix}\right)\overset{T_1}{\Rightarrow}\left(\begin{matrix}{\color{red}Y_{1}}=X_{1}\\ {\color{red}Y_{2}}=X_{2}\\ Y_{3}=\mu_{Y_{3}}(X_{1},X_{2})+\sigma_{Y_{3}}(X_{1},X_{2})\cdot X_{3}\\ Y_{4}=\mu_{Y_{4}}(X_{1},X_{2})+\sigma_{Y_{4}}(X_{1},X_{2})\cdot X_{4} \end{matrix}\right)\\ \left(\begin{matrix}{\color{red}X_{1}}=Y_{1}\\ {\color{red}X_{2}}=Y_{2}\\ X_{3}=\left[Y_{3}-\mu_{Y_{3}}(Y_{1},Y_{2})\right]/\sigma_{Y_{3}}(Y_{1},Y_{2})\\ X_{4}=\left[Y_{4}-\mu_{Y_{4}}(Y_{1},Y_{2})\right]/\sigma_{Y_{4}}(Y_{1},Y_{2}) \end{matrix}\right)\overset{T_1^{-1}}{\Leftarrow}\left(\begin{matrix}{\color{red}Y_{1}}\\ {\color{red}Y_{2}}\\ Y_{3}\\ Y_{4} \end{matrix}\right) \end{align*}$$` --- # Why Real NVP Works `\(\nabla T_1\)` is lower triangular: `$$\require{color}\begin{align*} \left(\begin{matrix}{\color{red}X_{1}}\\ {\color{red}X_{2}}\\ X_{3}\\ X_{4} \end{matrix}\right) & \overset{T_1}{\Rightarrow}\left(\begin{matrix}{\color{red}Y_{1}}=X_{1}\\ {\color{red}Y_{2}}=X_{2}\\ Y_{3}=\mu_{Y_{3}}(X_{1},X_{2})+\sigma_{Y_{3}}(X_{1},X_{2})\cdot X_{3}\\ Y_{4}=\mu_{Y_{4}}(X_{1},X_{2})+\sigma_{Y_{4}}(X_{1},X_{2})\cdot X_{4} \end{matrix}\right)\\ \nabla T_1(x) & =\left(\begin{array}{cccc} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\* & * & \sigma_{Y_{3}}(x_{1},x_{2}) & 0\\* & * & 0 & \sigma_{Y_{4}}(x_{1},x_{2}) \end{array}\right) \end{align*}$$` `\(\require{color}\color{deeppink}\det(\nabla T_1)\)` .highlight[can be computed in linear time (the product of diagonal elements).] --- # Why Real NVP Works Some theoretical works such as [7] show that Real NVP flows are universal approximators. .highlight[As a result, all three conditions are satisfied.] --- # Parameterization - Construct the transport map `\(T_\theta\)` using normalizing flows - The vector `\(\theta\)` contains all the neural network parameters - `\(\nabla_{\theta}\hat{\ell}(\theta)\)` is typically computed using automatic differentiation --- # Extensions Recent literature such as [8] shows that the measure transport sampler based on the KL loss function does not learn multimodal distributions well. Some improved samplers exist, and more is yet to be explored. --- # References .medium[ [1] https://artowen.su.domains/mc/Ch-var-is.pdf [2] Youssef Marzouk, Tarek Moselhy, Matthew Parno, and Alessio Spantini (2016). An introduction to sampling via measure transport. arXiv:1602.05023. [3] Matthew Hoffman et al. (2019). Neutra-lizing bad geometry in Hamiltonian Monte Carlo using neural transport. arXiv preprint arXiv:1903.03704. [4] George Papamakarios et al. (2021). Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research. [5] Ivan Kobyzev, Simon J.D. Prince, and Marcus A. Brubaker (2020). Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence. ] --- # References .medium[ [6] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio (2016). Density estimation using Real NVP. arXiv preprint arXiv:1605.08803. [7] Takeshi Teshima et al. (2020). Coupling-based invertible neural networks are universal diffeomorphism approximators. Advances in Neural Information Processing Systems 33. [8] Yixuan Qiu and Xiao Wang (2023). Efficient multimodal sampling via tempered distribution flow. Journal of the American Statistical Association. ]