+ - 0:00:00
Notes for current slide
Notes for next slide

Computational Statistics

Lecture 8

Yixuan Qiu

2022-11-02

1 / 35

Optimization

2 / 35

Last Time

  • Dealing with nonsmooth problems of the form \(F(x)=f(x)+h(x)\)

  • What if we have more than one nonsmooth term?

  • This is common in statistical models, e.g., by adding multiple regularization terms/constraints to the model

3 / 35

Today's Topics

  • Douglas-Rachford splitting method

  • Davis-Yin splitting method

  • Proximal-proximal-gradient algorithm

4 / 35

Motivation

  • For the nonsmooth function \(F(x)=f(x)+h(x)\), where \(f(x)\) is smooth and \(h(x)\) is nonsmooth, we have the proximal gradient descent algorithm

$$x^{(k+1)}=\mathbf{prox}_{\alpha h}\left(x^{(k)}-\alpha\cdot\nabla f(x^{(k)})\right),\quad k=0,1,\ldots$$

  • If we let \(f(x)=0\), and then we get the proximal point algorithm (PPA) for minimizing a nonsmooth function \(h(x)\):

$$x^{(k+1)}=\mathbf{prox}_{\alpha h}(x^{(k)}),\quad k=0,1,\ldots$$

5 / 35

Motivation

  • Now consider \(F(x)=g(x)+h(x)\), where both \(g(x)\) and \(h(x)\) are nonsmooth

  • Of course, we can try to compute \(\mathbf{prox}_{\alpha (g+h)}\), and then apply PPA

  • But unfortunately, in general \(\mathbf{prox}_{\alpha (g+h)}\neq\mathbf{prox}_{\alpha g}+\mathbf{prox}_{\alpha h}\) or other simple combination of \(\mathbf{prox}_{\alpha g}\) and \(\mathbf{prox}_{\alpha h}\)

  • So even if we can compute \(\mathbf{prox}_{\alpha g}\) and \(\mathbf{prox}_{\alpha h}\) individually, there is no obvious way to solve \(\min_x F(x)\)

6 / 35

Douglas-Rachford Splitting

7 / 35

Douglas-Rachford Splitting

The Douglas-Rachford splitting (DRS) algorithm is a useful method to solve the problem

$$\min_x\ F(x):= \min_x\ g(x)+h(x),$$

where \(g(x)\) and \(h(x)\) are convex functions, possibly nonsmooth.

The algorithm relies on \(\mathbf{prox}_{\alpha g}\) and \(\mathbf{prox}_{\alpha h}\).

8 / 35

DRS Algorithm

DRS uses the following iteration scheme: given an initial value \(y^{(0)}\), for \(k=0,1,\ldots\), iterate

$$\begin{align*} x^{(k+1)} & =\mathbf{prox}_{\alpha g}(y^{(k)})\\ y^{(k+1)} & =y^{(k)}+\mathbf{prox}_{\alpha h}(2x^{(k+1)}-y^{(k)})-x^{(k+1)} \end{align*}$$

  • The roles of \(g(x)\) and \(h(x)\) are not symmetric

  • The step size \(\alpha>0\) can be chosen arbitrarily, but it may affect the convergence speed

9 / 35

Convergence Property [1,4]

We first present the convergence property of the \(y^{(k)}\) sequence. Define $$T(y)=2\mathbf{prox}_{\alpha h}(2\mathbf{prox}_{\alpha g}(y)-y)-2\mathbf{prox}_{\alpha g}(y)+y,$$ and then it is easy to see that \(y^{(k+1)}=(y^{(k)}+T(y^{(k)}))/2\).

  • \(y^{(k)}\) converges to some fixed point \(y^*\) of \(T(\cdot)\), i.e., \(T(y^*)=y^*\).

  • \(\Vert y^{(k)}-y^* \Vert\) is monotonically nonincreasing.

  • \(\Vert y^{(k+1)}-y^{(k)} \Vert=\Vert T(y^{(k)})-y^{(k)} \Vert/2\) is monotonically nonincreasing and converges to 0.

10 / 35

Convergence Property [1,4]

(continued from last slide)

  • We have the asymptotic rate \(\Vert y^{(k+1)}-y^{(k)}\Vert^2=o(1/k).\)

  • Nonasymptotic rate $$\Vert y^{(k+1)}-y^{(k)}\Vert^2\le\frac{\Vert y^{(0)}-y^* \Vert^2}{k+1}.$$

11 / 35

Convergence Property [1,4]

\(y^*\) is connected with the optimization problem \(\min_x\ g(x)+h(x)\) via the following important conclusion:

If \(y^*\) is a point such that \(T(y^*)=y^*\), then \(x^*=\mathbf{prox}_{\alpha g}(y^*)\) is an optimal point of \(\min_x\ g(x)+h(x)\).

It has also been proved that \(x^{(k)}\) converges to some optimal point of \(\min_x\ g(x)+h(x)\).

Convergence rates will be introduced in the Davis-Yin splitting algorithm, which is a generalization of DRS.

12 / 35

Example - \(P_{C\cap D}(u)\)

Problem: given two closed convex sets \(C\) and \(D\), \(C\cap D\neq \varnothing\), compute the projection operator \(P_{C\cap D}(u)\).

In many cases we have simple \(P_C\) and \(P_D\) operations, but \(C\cap D\) may be complicated. For example:

13 / 35

Example - \(P_{C\cap D}(u)\)

The optimization problem becomes

$$\begin{align*} \min_{x} & \ \frac{1}{2}\Vert x-u\Vert^{2}\\ \text{s.t.} & \ x\in C,\ x\in D. \end{align*}$$

Or equivalently,

$$\min_{x}\ \frac{1}{2}\Vert x-u\Vert^{2}+I_{C}(x)+I_{D}(x),$$ where \(I_C(x)=0\) if \(x\in C\), and \(I_C(x)=\infty\) if \(x\notin C\).

14 / 35

Example - \(P_{C\cap D}(u)\)

Now let \(g(x)=\frac{1}{2}\Vert x-u\Vert^{2}+I_{C}(x)\), and then

$$\small\begin{align*} \mathbf{prox}_{\alpha g}(z) & =\underset{x}{\arg\min}\ \frac{1}{2}\Vert x-u\Vert^{2}+I_{C}(x)+\frac{1}{2\alpha}\Vert x-z\Vert^{2}\\ & =\underset{x}{\arg\min}\ \frac{(\alpha+1)\Vert x\Vert^{2}-2x'(\alpha u+z)}{2\alpha}+I_{C}(x)\\ & =\underset{x\in C}{\arg\min}\ \Vert x-(\alpha+1)^{-1}(\alpha u+z)\Vert^{2}\\ & =P_{C}\left((\alpha+1)^{-1}(\alpha u+z)\right). \end{align*}$$

Also, let \(h(x)=I_{D}(x)\), and \(\mathbf{prox}_{\alpha h}(z)=P_D(z)\). Then proceed using the DRS algorithm.

15 / 35

Davis-Yin Splitting

16 / 35

Motivation

  • Suppose that \(f(x)\) is a smooth convex function, and \(g(x)\) and \(h(x)\) are possibly nonsmooth convex functinos

  • Recall that proximal gradient descent minimizes \(f(x)+h(x)\)

  • DRS algorithm minimizes \(g(x)+h(x)\)

  • To unify the above two, we want to find an algorithm to minimize \(F(x)=f(x)+g(x)+h(x)\)

  • Later we will also see that such a "three-operator" problem is the key to handling the sum of an arbitrary number of functions

17 / 35

Davis-Yin Splitting

Consider the optimization problem

$$\min_x\ F(x):=\min_x\ f(x)+g(x)+h(x),$$

where \(f(x)\) is convex and \(L\)-smooth, and \(g(x)\) and \(h(x)\) are possibly nonsmooth convex functions.

The Davis-Yin splitting (DYS) algorithm uses the following iteration scheme: given an initial value \(y^{(0)}\) and a step size \(0<\alpha<2/L\), for \(k=0,1,\ldots\), iterate

$$\begin{align*} x^{(k+1)} & =\mathbf{prox}_{\alpha g}(y^{(k)})\\ y^{(k+1)} & =y^{(k)}+\mathbf{prox}_{\alpha h}(2x^{(k+1)}-y^{(k)}-\alpha\nabla f(y^{(k)}))-x^{(k+1)} \end{align*}$$

18 / 35

Some Remarks

  • For the smooth component \(f(x)\), we compute its gradient \(\nabla f(x)\)

  • For the nonsmooth terms \(g(x)\) and \(h(x)\), we use their proximal operators

  • The algorithm is similar to DRS, with an additional gradient descent term

19 / 35

Convergence Property [2]

Define $$\small T(y)=\mathbf{prox}_{\alpha h}(2\mathbf{prox}_{\alpha g}(y)-y-\alpha\nabla f(\mathbf{prox}_{\alpha g}(y)))-\mathbf{prox}_{\alpha g}(y)+y,$$ and then \(y^{(k+1)}=T(y^{(k)})\).

Similar to DRS, the following properties of the \(y^{(k)}\) sequence hold:

  • \(y^{(k)}\) converges to some fixed point \(y^*\) of \(T(\cdot)\).

  • \(\Vert y^{(k)}-y^* \Vert\) is monotonically nonincreasing.

  • \(\Vert y^{(k+1)}-y^{(k)} \Vert=\Vert T(y^{(k)})-y^{(k)} \Vert\) is monotonically nonincreasing and converges to 0.

20 / 35

Convergence Property [2]

The following convergence result is on the \(x^{(k)}\) variables.

Suppose \(h(x)\) is Lipschitz continuous on the closed ball \(B(0,(1+\alpha L)\Vert y^{(0)}-y^*\Vert)\), then

$$(f+g+h)(x^{(k)})-(f+g+h)(x^*)=o\left(\frac{1}{\sqrt{k+1}}\right).$$

21 / 35

Convergence Property [2]

The following convergence result is on the \(x^{(k)}\) variables.

Suppose \(h(x)\) is Lipschitz continuous on the closed ball \(B(0,(1+\alpha L)\Vert y^{(0)}-y^*\Vert)\), then

$$(f+g+h)(x^{(k)})-(f+g+h)(x^*)=o\left(\frac{1}{\sqrt{k+1}}\right).$$

It seems that the convergence rate is not significantly better than a subgradient method, but it shows that by properly averaging the iterates \(x^{(k)}\), we can get a faster speed.

21 / 35

Convergence Property [2]

Let $$\bar{x}^{(k)}=\frac{2}{(k+1)(k+2)}\sum_{i=0}^k (i+1)x^{(i)},$$ and then

$$(f+g+h)(\bar{x}^{(k)})-(f+g+h)(x^*)=O\left(\frac{1}{k+1}\right).$$

22 / 35

Accelerations

  • There exist several accelerated variants of the DYS algorithm under stronger assumptions

  • See [2] for details

23 / 35

Proximal-Proximal-Gradient Algorithm

24 / 35

Further Extensions

  • DYS has given an elegant solution to the nonsmooth convex optimization problem \(\min_x\ f(x)+g(x)+h(x)\)

  • But what if we have more than three components?

  • For smooth components, easy:

  • The gradients are additive, so if we have smooth components \(f_1(x),\ldots,f_m(x)\), then just let \(f=f_1+\cdots+f_m\), and hence \(\nabla f=\nabla f_1+\cdots+\nabla f_m\)

  • Directly apply DYS as usual (of course, the smoothness parameter \(L\) may change, which affects the step size \(\alpha\))

25 / 35

The Consensus Trick

  • Proximal operators are in general not additive, but we can use the "consensus trick".

  • Suppose we want to minimize \(F(x)=f(x)+\sum_{i=1}^m h_i(x)\), where \(f(x)\) is smooth and \(h_i(x)\) may be nonsmooth. Then we find that

$$\require{color}\begin{align*} x^{*} & \in\underset{x}{\arg\min}\ f(x)+\sum_{i=1}^{m}h_{i}(x)\\ \Leftrightarrow & \ (x^{*},\ldots,x^{*})\in\underset{\substack{x_{(1)},\ldots,x_{(m)}\\ x_{(1)}=\cdots=x_{(m)} } }{\arg\min}\ f(\textcolor{deeppink}{\bar{x}})+\sum_{i=1}^{m}h_{i}(\textcolor{deeppink}{x_{(i)}}), \end{align*}$$ where \(\bar{x}=m^{-1}\sum_{i=1}^m x_{(i)}\).

26 / 35

The Consensus Trick

Therefore, if we want \(x^*\in\mathbb{R}^d\), then we can work on a "stacked" variable \(\mathbf{x}=(x_{(1)},\ldots,x_{(m)})\in\mathbb{R}^{md}\), and optimize the function \(\tilde{F}(\mathbf{x}):=f(\bar{\mathbf{x}})+I_{C}(\mathbf{x})+\tilde{h}(\mathbf{x})\), where

  • \(\bar{\mathbf{x}}=m^{-1}\sum_{i=1}^m x_{(i)}\)
  • \(C=\{\mathbf{x}:x_{(1)}=\cdots=x_{(m)}\}\)
  • \(\tilde{h}(\mathbf{x})=\sum_{i=1}^m h_i(x_{(i)})\)

We have shown that an optimal point of \(\tilde{F}(\mathbf{x})\) is \((x^{*},\ldots,x^{*})\), where \(x^*\) is an optimal point of the original problem.

27 / 35

Proximal Operators

  • More importantly, we can show that

  • \(\mathbf{prox}_{\alpha I_{C}}(\mathbf{x})=(\bar{\mathbf{x}},\ldots,\bar{\mathbf{x}})\)

  • \(\mathbf{prox}_{\alpha\tilde{h}}(\mathbf{x})=(\mathbf{prox}_{\alpha h_{1}}(x_{(1)}),\ldots,\mathbf{prox}_{\alpha h_{m}}(x_{(m)}))\)

  • This means that we only need to evaluate \(\require{color}\color{deeppink}\mathbf{prox}_{\alpha h_{i}}(\cdot)\) individually!

  • This is essentially the key idea of proximal-proximal-gradient (PPG) algorithm

28 / 35

PPG Algorithm

Consider the optimization problem

$$\min_{x\in\mathbb{R}^d} F(x):=\min_{x\in\mathbb{R}^d}\ r(x)+\frac{1}{n}\sum_{i=1}^{n}(f_{i}(x)+g_{i}(x))$$

  • \(r(x)\), \(f_i(x)\), and \(g_i(x)\) are convex functions

  • \(f_i(x)\) are differentiable

  • \(r(x)\) and \(g_i(x)\) have simple proximal operators

  • Generalization of DYS

29 / 35

PPG Algorithm

Given an initial value \(\mathbf{z}^{(0)}=(z_{(1)}^{(0)},\ldots,z_{(n)}^{(0)})\), iterate

$$\small\begin{align*} x^{(k+1/2)} & =\mathbf{prox}_{\alpha r}(\bar{\mathbf{z}}^{(k)})\\ x_{(i)}^{(k+1)} & =\mathbf{prox}_{\alpha g_{i}}\left(2x^{(k+1/2)}-z_{(i)}^{(k)}-\alpha\nabla f_{i}(x^{(k+1/2)})\right),\quad i=1,\ldots,n\\ z_{(i)}^{(k+1)} & =z_{(i)}^{(k)}+x_{(i)}^{(k+1)}-x^{(k+1/2)},\quad i=1,\ldots,n \end{align*}$$

Remarks:

  • \(\mathbf{z}^{(k)}=(z_{(1)}^{(k)},\ldots,z_{(n)}^{(k)})\in\mathbb{R}^{nd}\), subscripts for copies, superscripts for iteration numbers
  • \(\bar{\mathbf{z}}^{(k)}=n^{-1}\sum_{i=1}^{n}z_{(i)}^{(k)}\in\mathbb{R}^{d}\)
  • Updates of \(x_{(i)}^{(k+1)}\) and \(z_{(i)}^{(k+1)}\) can be parallelized across \(i\)
30 / 35

Convergence Property [3]

We first state the convergence of the \(\mathbf{z}^{(k)}\) variables.

Suppose that \(f_1(x)\ldots,f_n(x)\) are differentiable and \(L\)-smooth. Select a step size \(0<\alpha<3/(2L)\), and denote \(p(\mathbf{z}^{(k)})=\alpha^{-1}(\mathbf{z}^{(k+1)}-\mathbf{z}^{(k)})\). Then \(\Vert p(\mathbf{z}^{(k)})\Vert\rightarrow 0\) monotonically with the rate $$\Vert p(\mathbf{z}^{(k)})\Vert= O(1/\sqrt{k}).$$

31 / 35

Convergence Property [3]

For the \(x\)-variables, we have \(x^{(k+1/2)}\rightarrow x^*\) and \(x_{(i)}^{(k)}\rightarrow x^*\) for all \(i=1,\ldots,n\), where \(x^*\) is an optimal point of \(F(x)\).

If in addition, \(\bar{g}(x)=n^{-1}\sum_{i=1}^n g_i(x)\) is Lipschitz continuous, then

$$F(x^{(k+1/2)})-F(x^*)=O(1/\sqrt{k}).$$

Finally, let \(x_{avg}^{(k+1/2)}=k^{-1}\sum_{j=1}^k x^{(j+1/2)}\), then under the same assumptions we have

$$F(x_{avg}^{(k+1/2)})-F(x^*)=O(1/k).$$

32 / 35

Accelerations

Other faster convergence rates with stronger assumptions can be found in [3].

33 / 35

Summary

  • We have summarized three important algorithms for nonsmooth optimization problems

  • DRS \(\rightarrow\) DYS \(\rightarrow\) PPG with increasing generality

  • These methods are very useful for statistical models with multiple constraints and/or regularization terms

  • They typically have much better convergence speed than subgradient methods

34 / 35

References

[1] Damek Davis and Wotao Yin (2016). Convergence rate analysis of several splitting schemes. Splitting methods in communication, imaging, science, and engineering.

[2] Damek Davis and Wotao Yin (2017). A three-operator splitting scheme and its optimization applications. Set-valued and variational analysis.

[3] Ernest K. Ryu and Wotao Yin (2019). Proximal-proximal-gradient method. Journal of Computational Mathematics.

[4] Patrick L. Combettes (2004). Solving monotone inclusions via compositions of nonexpansive averaged operators. Optimization.

35 / 35

Optimization

2 / 35
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow