class: center, middle, inverse, title-slide .title[ # Computational Statistics ] .subtitle[ ## Lecture 11 ] .author[ ### Yixuan Qiu ] .date[ ### 2022-11-30 ] --- class: inverse, center, middle # Simulation and Sampling --- # Today's Topics - Importance sampling - Measure transport sampler --- class: inverse, center, middle # Importance Sampling --- # Importance sampling Strictly speaking, importance sampling (IS) is not a method to obtain sample points `\(X_1,\ldots,X_n\)` that follow the target distribution `\(p(x)\)`. Instead, it is a technique to estimate expectations related to `\(p(x)\)`: `$$\mu=\mathbb{E}_X[f(X)]=\int f(x)p(x)\mathrm{d}x,\quad X\sim p(x).$$` --- # A Direct Solution Of course, one direct method to approximate `\(\mu\)` is to generate `\(X_1,\ldots,X_M\sim p(x)\)`, and then an unbiased estimator for `\(\mu\)` is given by `$$\hat{\mu}=\frac{1}{M}\sum_{i=1}^M f(X_i),\quad X_i\sim p(x),\ i=1,\ldots,M.$$` Suppose that we use rejection sampling to get `\(X_i\)` based on a proposal distribution `\(q(x)\)`. There are two issues here: - Rejection sampling discards sample points - `\(f(x)\)` may be close to zero outside a region `\(A\)` for which `\(P(X\in A)\)` is small --- # Example We want to estimate `\(p=P(X>\pi)\)` and `\(\mu=\mathbb{E}(X|X>\pi)\)`, where `\(X\sim N(0,1)\)`. Naive solution: sample `\(X_1,\ldots,X_M\overset{iid}{\sim}N(0,1)\)`, and get `$$\begin{align*} \hat{p} & =\frac{1}{M}\sum_{i=1}^{M}I\{X_{i}>\pi\},\\ \hat{\mu} & =\frac{\sum_{i=1}^{M}X_{i}\cdot I\{X_{i}>\pi\}}{\sum_{i=1}^{M}I\{X_{i}>\pi\}}. \end{align*}$$` Problem: `\(p\)` is very small (true value ~0.00084), so maybe all `\(X_i\)`'s are smaller than `\(\pi\)`. --- # Example ```r est_naive = function(n) { x = rnorm(n) p_hat = mean(x > pi) mu_hat = sum(x * (x > pi)) / sum(x > pi) c(p_hat, mu_hat) } set.seed(123) est_naive(n = 100) ``` ``` ## [1] 0 NaN ``` --- # Motivation IS attempts to resolve the previous issues: - It does not discard or waste any sample points - Instead, it assigns different weights to each point - By properly choosing the proposal distribution, it is able to more effectively generate sample points around the "important region" `\(A\)` --- # Basic Idea The idea of IS is in fact quite simple. It is based on a straightforward identity: `$$\mu=\int f(x)p(x)\mathrm{d}x=\int\frac{f(x)p(x)}{q(x)}q(x)\mathrm{d}x=\mathbb{E}_{q}\left(\frac{f(X)p(X)}{q(X)}\right),$$` where `\(q(x)\)` is another density function that is positive on `\(\mathbb{R}^p\)`, and `\(\mathbb{E}_q(\cdot)\)` denotes the expectation for `\(X\sim q(x)\)`. Accordingly, the IS estimate for `\(\mu\)` is `$$\require{color}\hat{\mu}_q=\frac{1}{M}\sum_{i=1}^M \frac{f(X_i)p(X_i)}{q(X_i)},\quad X_i\sim \textcolor{deeppink}{q(x)},\ i=1,\ldots,M.$$` --- # Theorem <sup><span class="small">[2]</span></sup> Suppose that `\(q(x)>0\)` whenever `\(f(x)p(x)\neq 0\)`. Then `\(\mathbb{E}_q(\hat{\mu}_q)=\mu\)`, and `\(\mathrm{Var}_q(\hat{\mu}_q)=\sigma_q^2/n\)`, where `$$\sigma_{q}^{2}=\int_{\mathcal{Q}}\frac{[f(x)p(x)]^{2}}{q(x)}\mathrm{d}x-\mu^{2}=\int_{\mathcal{Q}}\frac{[f(x)p(x)-\mu q(x)]^{2}}{q(x)}\mathrm{d}x,$$` and `\(\mathcal{Q}=\{x:q(x)>0\}\)`. --- # Example Back to the previous example, note that `$$\begin{align*} p & =\int_{-\infty}^{+\infty}I(x>\pi)\phi(x)\mathrm{d}x\\ & =\int_{\pi}^{+\infty}\frac{I(x>\pi)\phi(x)}{q(x)}q(x)\mathrm{d}x=\int_{\pi}^{+\infty}\frac{\phi(x)}{q(x)}q(x)\mathrm{d}x,\\ q(x) & =\begin{cases} 0, & x\le\pi\\ e^{-(x-\pi)}, & x>\pi \end{cases}, \end{align*}$$` where `\(q(x)\)` is a shifted exponential distribution. Similarly, `$$\mu=p^{-1}\int_{\pi}^{+\infty}\frac{x\cdot\phi(x)}{q(x)}q(x)\mathrm{d}x.$$` --- # Example ```r est_is = function(n) { x = rexp(n) + pi ratio = exp(dnorm(x, log = TRUE) + x - pi) p_hat = mean(ratio) mu_hat = mean(x * ratio) / p_hat c(p_hat, mu_hat) } set.seed(123) est_is(n = 100) ``` ``` ## [1] 0.0007478989 3.4296517837 ``` --- # Optimal `\(q(x)\)` <sup><span class="small">[2]</span></sup> It can be proved that the optimal proposal distribution `\(q^*(x)\)` is given by `\(q^*(x)=|f(x)|p(x)/\mathbb{E}_p (|f(X)|)\)`. Proof: for any density function `\(q(x)\)` such that `\(q(x)>0\)` when `\(f(x)p(x)\neq 0\)`, `$$\begin{align*} \mu^{2}+\sigma_{q^{*}}^{2} & =\int\frac{[f(x)p(x)]^{2}}{q^{*}(x)}\mathrm{d}x=[\mathbb{E}_{p}(|f(X)|)]^{2}\\ & =\left[\mathbb{E}_{q}\left(\frac{|f(X)|p(X)}{q(X)}\right)\right]^{2}\\ & \le\mathbb{E}_{q}\left(\frac{[f(X)p(X)]^{2}}{[q(X)]^{2}}\right)=\int\frac{[f(x)p(x)]^{2}}{q(x)}\mathrm{d}x=\mu^{2}+\sigma_{q}^{2}. \end{align*}$$` --- # Optimal `\(q(x)\)` This means that to approximate `\(\int f(x)p(x)\mathrm{d}x\)`, IS can be better than the simple Monte Carlo estimator! --- # Self-Normalized IS Suppose that we can only compute `\(p_u(x)\propto p(x)\)` and `\(q_u(x)\propto q(x)\)`, the .highlight[self-normalized IS estimate] is given by `$$\tilde{\mu}=\frac{\sum_{i=1}^{M}f(X_{i})w(X_{i})}{\sum_{i=1}^{M}w(X_{i})},$$` where `\(w(x)=p_u(x)/q_u(x)\)` and `\(X_i\sim q(x)\)`. Under mild conditions, `\(\tilde{\mu}\)` is a consistent estimator of `\(\mu\)`, but in general `\(\tilde{\mu}\)` .highlight[is no longer unbiased]. --- class: inverse, center, middle # Measure Transport Sampler --- # Recap: Inverse Transform Algorithm - `\(U\sim Unif(0,1)\)` - `\(g=F^{-1}\)` - `\(X=g(U)\Rightarrow X\sim f(x)\)` - `\(f(x)=F'(x)\)` is the density function --- # Random Variable Transformation - Continuous random variable `\(X\sim p_X(x)\)` - `\(p_X(x)\)` density function - `\(g:\mathbb{R}\rightarrow\mathbb{R}\)` a .highlight[monotone] function - Define `\(Y=g(X)\)`, then its density function is given by `$$p_{Y}(y)=p_{X}(g^{-1}(y))\left|\frac{\mathrm{d}}{\mathrm{d}y}g^{-1}(y)\right|$$` - Can extend to multivariate case --- # Random Vector Transformation - Continuous random vector `\(X\in\mathbb{R}^d\)`, `\(X\sim p_X(x)\)` - `\(p_X(x)\)` density function - `\(T:\mathbb{R}^d\rightarrow\mathbb{R}^d\)` a .highlight[diffeomorphism] (a smooth mapping with smooth inverse) - Define `\(Y=T(X)\)`, then its density function is given by `$$\begin{align*} p_{Y}(y) & =p_{X}(T^{-1}(y))\left|\det\left(\nabla(T^{-1})(y)\right)\right|\\ & =p_{X}(x)\left|\det\left(\nabla T(x)\right)\right|^{-1},\quad x=T^{-1}(y) \end{align*}$$` - `\(\nabla T\)` (cf. `\(\nabla T^{-1}\)`) is the Jacobian matrix of `\(T\)` (cf. `\(T^{-1}\)`) --- # Recall: Box-Muller Transform 1. `\((U_1,U_2)\sim \mathrm{Unif}([0,1]^2)\)` 2. Let `\(Z_1=\sqrt{-2\log(U_1)}\cos(2\pi U_2)\)` and `\(Z_2=\sqrt{-2\log(U_1)}\sin(2\pi U_2)\)` 3. Then `\(Z_1\)` and `\(Z_2\)` are two .highlight[independent] `\(N(0,1)\)` random variables --- # Recall: Box-Muller Transform - Transformation mapping `$$\small T\left(\begin{array}{c} U_{1}\\ U_{2} \end{array}\right)=\left(\begin{array}{c} \sqrt{-2\log(U_{1})}\cos(2\pi U_{2})\\ \sqrt{-2\log(U_{1})}\sin(2\pi U_{2}) \end{array}\right)$$` - Jacobian matrix `$$\small \begin{align*} \nabla T\left(\begin{array}{c} U_{1}\\ U_{2} \end{array}\right) & =\left(\begin{array}{cc} \frac{-\cos(\theta)}{U_{1}L} & -2\pi L\sin(\theta)\\ \frac{-\sin(\theta)}{U_{1}L} & 2\pi L\cos(\theta) \end{array}\right)\\ L & =\sqrt{-2\log(U_{1})},\quad\theta=2\pi U_{2} \end{align*}$$` - Determinant `$$\small \det\nabla T=-2\pi U_{1}^{-1}$$` --- # Recall: Box-Muller Transform - Clearly, `\(p_U(u_1,u_2)=1\)`, so `$$p_{Z}(z_{1},z_{2})=p_U(u_1,u_2)|\det\nabla T|^{-1}=(2\pi)^{-1}u_1$$` - To express `\(u_1\)` using `\(z_1\)` and `\(z_2\)`, note that `$$Z_1^2+Z_2^2=-2\log(U_{1})$$` - Therefore, `$$p_{Z}(z_{1},z_{2})=\frac{1}{2\pi}e^{-\frac{1}{2}(z_{1}^{2}+z_{2}^{2})},$$` which implies that `\(Z_1\)` and `\(Z_2\)` follow independent `\(N(0,1)\)` --- # Ideas from Inverse Transform - Fix a source distribution `\(p_Z(z)\)`, e.g. `\(N(0,I)\)` - Given a target distribution `\(p_X(x)\)` - Can we .highlight[learn a mapping] `\(T\)` such that `\(X=T(Z)\sim p_X(x)\)` if `\(Z\sim p_Z(z)\)`? --- # Measure Transport Sampler - If we can obtain such a mapping, sampling would be easy: - Simulate `\(Z_1,\ldots,Z_n\sim N(0,I)\)` - Set `\(X_1=T(Z_1),\ldots,X_n=T(Z_n)\)` - Then `\(X_i\overset{iid}{\sim}p_X(x)\)` - The key is to find such a mapping `\(T\)`, also called a .highlight[transport map] --- # Transport Map For practical use, the transport map `\(T\)` needs to have some "nice" properties: - `\(T\)` should be invertible and differentiable - `\(T^{-1}\)` and `\(\det(\nabla T(x))\)` should be easy to compute - `\(T\)` should be flexible enough to characterize sophisticated nonlinear mappings --- # Modeling Transport Maps - Polynomials <sup><span class="small">[3]</span></sup> (not good enough) - Normalizing flows <sup><span class="small">[4]</span></sup> (tools from the deep learning community) --- # Training See [3] for details. --- # References .medium[ [1] Sheldon M. Ross (2011). Simulation. Academic Press. [2] https://artowen.su.domains/mc/Ch-var-is.pdf [3] Youssef Marzouk, Tarek Moselhy, Matthew Parno, and Alessio Spantini (2016). An introduction to sampling via measure transport. arXiv:1602.05023. [4] Matthew Hoffman et al. (2019). Neutra-lizing bad geometry in Hamiltonian Monte Carlo using neural transport. arXiv preprint arXiv:1903.03704. ]