Computational Statistics
Lecture 8
Yixuan Qiu
2022-11-02
1 / 35

Optimization2 / 35

Last Time

Dealing with nonsmooth problems of the form $F(x)=f(x)+h(x)$
What if we have more than one nonsmooth term?
This is common in statistical models, e.g., by adding multiple regularization terms/constraints to the model

3 / 35

Today's Topics

Douglas-Rachford splitting method
Davis-Yin splitting method
Proximal-proximal-gradient algorithm

4 / 35

Motivation

For the nonsmooth function $F(x)=f(x)+h(x)$, where $f(x)$ is smooth and $h(x)$ is nonsmooth, we have the proximal gradient descent algorithm

$$x^{(k+1)}=\mathbf{prox}_{\alpha h}\left(x^{(k)}-\alpha\cdot\nabla f(x^{(k)})\right),\quad k=0,1,\ldots$$

If we let $f(x)=0$, and then we get the proximal point algorithm (PPA) for minimizing a nonsmooth function $h(x)$:

$$x^{(k+1)}=\mathbf{prox}_{\alpha h}(x^{(k)}),\quad k=0,1,\ldots$$

5 / 35

Motivation

Now consider $F(x)=g(x)+h(x)$, where both $g(x)$ and $h(x)$ are nonsmooth
Of course, we can try to compute $\mathbf{prox}_{\alpha (g+h)}$, and then apply PPA
But unfortunately, in general $\mathbf{prox}_{\alpha (g+h)}\neq\mathbf{prox}_{\alpha g}+\mathbf{prox}_{\alpha h}$ or other simple combination of $\mathbf{prox}_{\alpha g}$ and $\mathbf{prox}_{\alpha h}$
So even if we can compute $\mathbf{prox}_{\alpha g}$ and $\mathbf{prox}_{\alpha h}$ individually, there is no obvious way to solve $\min_x F(x)$

6 / 35

Douglas-Rachford Splitting7 / 35

Douglas-Rachford Splitting

The Douglas-Rachford splitting (DRS) algorithm is a useful method to solve the problem

$$\min_x\ F(x):= \min_x\ g(x)+h(x),$$

where $g(x)$ and $h(x)$ are convex functions, possibly nonsmooth.

The algorithm relies on $\mathbf{prox}_{\alpha g}$ and $\mathbf{prox}_{\alpha h}$.

8 / 35

DRS Algorithm

DRS uses the following iteration scheme: given an initial value $y^{(0)}$, for $k=0,1,\ldots$, iterate

$$\begin{align*} x^{(k+1)} & =\mathbf{prox}_{\alpha g}(y^{(k)})\\ y^{(k+1)} & =y^{(k)}+\mathbf{prox}_{\alpha h}(2x^{(k+1)}-y^{(k)})-x^{(k+1)} \end{align*}$$

The roles of $g(x)$ and $h(x)$ are not symmetric
The step size $\alpha>0$ can be chosen arbitrarily, but it may affect the convergence speed

9 / 35

Convergence Property ^[1,4]

We first present the convergence property of the $y^{(k)}$ sequence. Define $$T(y)=2\mathbf{prox}_{\alpha h}(2\mathbf{prox}_{\alpha g}(y)-y)-2\mathbf{prox}_{\alpha g}(y)+y,$$ and then it is easy to see that $y^{(k+1)}=(y^{(k)}+T(y^{(k)}))/2$.

$y^{(k)}$ converges to some fixed point $y^*$ of $T(\cdot)$, i.e., $T(y^*)=y^*$.
$\Vert y^{(k)}-y^* \Vert$ is monotonically nonincreasing.
$\Vert y^{(k+1)}-y^{(k)} \Vert=\Vert T(y^{(k)})-y^{(k)} \Vert/2$ is monotonically nonincreasing and converges to 0.

10 / 35

Convergence Property ^[1,4]

(continued from last slide)

We have the asymptotic rate $\Vert y^{(k+1)}-y^{(k)}\Vert^2=o(1/k).$
Nonasymptotic rate $$\Vert y^{(k+1)}-y^{(k)}\Vert^2\le\frac{\Vert y^{(0)}-y^* \Vert^2}{k+1}.$$

11 / 35

Convergence Property ^[1,4]

$y^*$ is connected with the optimization problem $\min_x\ g(x)+h(x)$ via the following important conclusion:

If $y^*$ is a point such that $T(y^*)=y^*$, then $x^*=\mathbf{prox}_{\alpha g}(y^*)$ is an optimal point of $\min_x\ g(x)+h(x)$.

It has also been proved that $x^{(k)}$ converges to some optimal point of $\min_x\ g(x)+h(x)$.

Convergence rates will be introduced in the Davis-Yin splitting algorithm, which is a generalization of DRS.

12 / 35

Example - $P_{C\cap D}(u)$

Problem: given two closed convex sets $C$ and $D$, $C\cap D\neq \varnothing$, compute the projection operator $P_{C\cap D}(u)$.

In many cases we have simple $P_C$ and $P_D$ operations, but $C\cap D$ may be complicated. For example:

13 / 35

Example - $P_{C\cap D}(u)$

The optimization problem becomes

$$\begin{align*} \min_{x} & \ \frac{1}{2}\Vert x-u\Vert^{2}\\ \text{s.t.} & \ x\in C,\ x\in D. \end{align*}$$

Or equivalently,

$$\min_{x}\ \frac{1}{2}\Vert x-u\Vert^{2}+I_{C}(x)+I_{D}(x),$$ where $I_C(x)=0$ if $x\in C$, and $I_C(x)=\infty$ if $x\notin C$.

14 / 35

Example - $P_{C\cap D}(u)$

Now let $g(x)=\frac{1}{2}\Vert x-u\Vert^{2}+I_{C}(x)$, and then

$$\small\begin{align*} \mathbf{prox}_{\alpha g}(z) & =\underset{x}{\arg\min}\ \frac{1}{2}\Vert x-u\Vert^{2}+I_{C}(x)+\frac{1}{2\alpha}\Vert x-z\Vert^{2}\\ & =\underset{x}{\arg\min}\ \frac{(\alpha+1)\Vert x\Vert^{2}-2x'(\alpha u+z)}{2\alpha}+I_{C}(x)\\ & =\underset{x\in C}{\arg\min}\ \Vert x-(\alpha+1)^{-1}(\alpha u+z)\Vert^{2}\\ & =P_{C}\left((\alpha+1)^{-1}(\alpha u+z)\right). \end{align*}$$

Also, let $h(x)=I_{D}(x)$, and $\mathbf{prox}_{\alpha h}(z)=P_D(z)$. Then proceed using the DRS algorithm.

15 / 35

Davis-Yin Splitting16 / 35

Motivation

Suppose that $f(x)$ is a smooth convex function, and $g(x)$ and $h(x)$ are possibly nonsmooth convex functinos
Recall that proximal gradient descent minimizes $f(x)+h(x)$
DRS algorithm minimizes $g(x)+h(x)$
To unify the above two, we want to find an algorithm to minimize $F(x)=f(x)+g(x)+h(x)$
Later we will also see that such a "three-operator" problem is the key to handling the sum of an arbitrary number of functions

17 / 35

Davis-Yin Splitting

Consider the optimization problem

$$\min_x\ F(x):=\min_x\ f(x)+g(x)+h(x),$$

where $f(x)$ is convex and $L$-smooth, and $g(x)$ and $h(x)$ are possibly nonsmooth convex functions.

The Davis-Yin splitting (DYS) algorithm uses the following iteration scheme: given an initial value $y^{(0)}$ and a step size $0<\alpha<2/L$, for $k=0,1,\ldots$, iterate

$$\begin{align*} x^{(k+1)} & =\mathbf{prox}_{\alpha g}(y^{(k)})\\ y^{(k+1)} & =y^{(k)}+\mathbf{prox}_{\alpha h}(2x^{(k+1)}-y^{(k)}-\alpha\nabla f(y^{(k)}))-x^{(k+1)} \end{align*}$$

18 / 35

Some Remarks

For the smooth component $f(x)$, we compute its gradient $\nabla f(x)$
For the nonsmooth terms $g(x)$ and $h(x)$, we use their proximal operators
The algorithm is similar to DRS, with an additional gradient descent term

19 / 35

Convergence Property ^[2]

Define $$\small T(y)=\mathbf{prox}_{\alpha h}(2\mathbf{prox}_{\alpha g}(y)-y-\alpha\nabla f(\mathbf{prox}_{\alpha g}(y)))-\mathbf{prox}_{\alpha g}(y)+y,$$ and then $y^{(k+1)}=T(y^{(k)})$.

Similar to DRS, the following properties of the $y^{(k)}$ sequence hold:

$y^{(k)}$ converges to some fixed point $y^*$ of $T(\cdot)$.
$\Vert y^{(k)}-y^* \Vert$ is monotonically nonincreasing.
$\Vert y^{(k+1)}-y^{(k)} \Vert=\Vert T(y^{(k)})-y^{(k)} \Vert$ is monotonically nonincreasing and converges to 0.

20 / 35

Convergence Property ^[2]

The following convergence result is on the $x^{(k)}$ variables.

Suppose $h(x)$ is Lipschitz continuous on the closed ball $B(0,(1+\alpha L)\Vert y^{(0)}-y^*\Vert)$, then

$$(f+g+h)(x^{(k)})-(f+g+h)(x^*)=o\left(\frac{1}{\sqrt{k+1}}\right).$$

21 / 35

Convergence Property ^[2]

The following convergence result is on the $x^{(k)}$ variables.

Suppose $h(x)$ is Lipschitz continuous on the closed ball $B(0,(1+\alpha L)\Vert y^{(0)}-y^*\Vert)$, then

$$(f+g+h)(x^{(k)})-(f+g+h)(x^*)=o\left(\frac{1}{\sqrt{k+1}}\right).$$

It seems that the convergence rate is not significantly better than a subgradient method, but it shows that by properly averaging the iterates $x^{(k)}$, we can get a faster speed.

21 / 35

Convergence Property ^[2]

Let $$\bar{x}^{(k)}=\frac{2}{(k+1)(k+2)}\sum_{i=0}^k (i+1)x^{(i)},$$ and then

$$(f+g+h)(\bar{x}^{(k)})-(f+g+h)(x^*)=O\left(\frac{1}{k+1}\right).$$

22 / 35

Accelerations

There exist several accelerated variants of the DYS algorithm under stronger assumptions
See [2] for details

23 / 35

Proximal-Proximal-Gradient Algorithm24 / 35

Further Extensions

DYS has given an elegant solution to the nonsmooth convex optimization problem $\min_x\ f(x)+g(x)+h(x)$
But what if we have more than three components?
For smooth components, easy:
The gradients are additive, so if we have smooth components $f_1(x),\ldots,f_m(x)$, then just let $f=f_1+\cdots+f_m$, and hence $\nabla f=\nabla f_1+\cdots+\nabla f_m$
Directly apply DYS as usual (of course, the smoothness parameter $L$ may change, which affects the step size $\alpha$)

25 / 35

The Consensus Trick

Proximal operators are in general not additive, but we can use the "consensus trick".
Suppose we want to minimize $F(x)=f(x)+\sum_{i=1}^m h_i(x)$, where $f(x)$ is smooth and $h_i(x)$ may be nonsmooth. Then we find that

$$\require{color}\begin{align*} x^{*} & \in\underset{x}{\arg\min}\ f(x)+\sum_{i=1}^{m}h_{i}(x)\\ \Leftrightarrow & \ (x^{*},\ldots,x^{*})\in\underset{\substack{x_{(1)},\ldots,x_{(m)}\\ x_{(1)}=\cdots=x_{(m)} } }{\arg\min}\ f(\textcolor{deeppink}{\bar{x}})+\sum_{i=1}^{m}h_{i}(\textcolor{deeppink}{x_{(i)}}), \end{align*}$$ where $\bar{x}=m^{-1}\sum_{i=1}^m x_{(i)}$.

26 / 35

The Consensus Trick

Therefore, if we want $x^*\in\mathbb{R}^d$, then we can work on a "stacked" variable $\mathbf{x}=(x_{(1)},\ldots,x_{(m)})\in\mathbb{R}^{md}$, and optimize the function $\tilde{F}(\mathbf{x}):=f(\bar{\mathbf{x}})+I_{C}(\mathbf{x})+\tilde{h}(\mathbf{x})$, where

$\bar{\mathbf{x}}=m^{-1}\sum_{i=1}^m x_{(i)}$
$C=\{\mathbf{x}:x_{(1)}=\cdots=x_{(m)}\}$
$\tilde{h}(\mathbf{x})=\sum_{i=1}^m h_i(x_{(i)})$

We have shown that an optimal point of $\tilde{F}(\mathbf{x})$ is $(x^{*},\ldots,x^{*})$, where $x^*$ is an optimal point of the original problem.

27 / 35

Proximal Operators

More importantly, we can show that
$\mathbf{prox}_{\alpha I_{C}}(\mathbf{x})=(\bar{\mathbf{x}},\ldots,\bar{\mathbf{x}})$
$\mathbf{prox}_{\alpha\tilde{h}}(\mathbf{x})=(\mathbf{prox}_{\alpha h_{1}}(x_{(1)}),\ldots,\mathbf{prox}_{\alpha h_{m}}(x_{(m)}))$
This means that we only need to evaluate $\require{color}\color{deeppink}\mathbf{prox}_{\alpha h_{i}}(\cdot)$ individually!
This is essentially the key idea of proximal-proximal-gradient (PPG) algorithm

28 / 35

PPG Algorithm

Consider the optimization problem

$$\min_{x\in\mathbb{R}^d} F(x):=\min_{x\in\mathbb{R}^d}\ r(x)+\frac{1}{n}\sum_{i=1}^{n}(f_{i}(x)+g_{i}(x))$$

$r(x)$, $f_i(x)$, and $g_i(x)$ are convex functions
$f_i(x)$ are differentiable
$r(x)$ and $g_i(x)$ have simple proximal operators
Generalization of DYS

29 / 35

PPG Algorithm

Given an initial value $\mathbf{z}^{(0)}=(z_{(1)}^{(0)},\ldots,z_{(n)}^{(0)})$, iterate

$$\small\begin{align*} x^{(k+1/2)} & =\mathbf{prox}_{\alpha r}(\bar{\mathbf{z}}^{(k)})\\ x_{(i)}^{(k+1)} & =\mathbf{prox}_{\alpha g_{i}}\left(2x^{(k+1/2)}-z_{(i)}^{(k)}-\alpha\nabla f_{i}(x^{(k+1/2)})\right),\quad i=1,\ldots,n\\ z_{(i)}^{(k+1)} & =z_{(i)}^{(k)}+x_{(i)}^{(k+1)}-x^{(k+1/2)},\quad i=1,\ldots,n \end{align*}$$

Remarks:

$\mathbf{z}^{(k)}=(z_{(1)}^{(k)},\ldots,z_{(n)}^{(k)})\in\mathbb{R}^{nd}$, subscripts for copies, superscripts for iteration numbers
$\bar{\mathbf{z}}^{(k)}=n^{-1}\sum_{i=1}^{n}z_{(i)}^{(k)}\in\mathbb{R}^{d}$
Updates of $x_{(i)}^{(k+1)}$ and $z_{(i)}^{(k+1)}$ can be parallelized across $i$

30 / 35

Convergence Property ^[3]

We first state the convergence of the $\mathbf{z}^{(k)}$ variables.

Suppose that $f_1(x)\ldots,f_n(x)$ are differentiable and $L$-smooth. Select a step size $0<\alpha<3/(2L)$, and denote $p(\mathbf{z}^{(k)})=\alpha^{-1}(\mathbf{z}^{(k+1)}-\mathbf{z}^{(k)})$. Then $\Vert p(\mathbf{z}^{(k)})\Vert\rightarrow 0$ monotonically with the rate $$\Vert p(\mathbf{z}^{(k)})\Vert= O(1/\sqrt{k}).$$

31 / 35

Convergence Property ^[3]

For the $x$-variables, we have $x^{(k+1/2)}\rightarrow x^*$ and $x_{(i)}^{(k)}\rightarrow x^*$ for all $i=1,\ldots,n$, where $x^*$ is an optimal point of $F(x)$.

If in addition, $\bar{g}(x)=n^{-1}\sum_{i=1}^n g_i(x)$ is Lipschitz continuous, then

$$F(x^{(k+1/2)})-F(x^*)=O(1/\sqrt{k}).$$

Finally, let $x_{avg}^{(k+1/2)}=k^{-1}\sum_{j=1}^k x^{(j+1/2)}$, then under the same assumptions we have

$$F(x_{avg}^{(k+1/2)})-F(x^*)=O(1/k).$$

32 / 35

Accelerations

Other faster convergence rates with stronger assumptions can be found in [3].

33 / 35

Summary

We have summarized three important algorithms for nonsmooth optimization problems
DRS $\rightarrow$ DYS $\rightarrow$ PPG with increasing generality
These methods are very useful for statistical models with multiple constraints and/or regularization terms
They typically have much better convergence speed than subgradient methods

34 / 35

References

[1] Damek Davis and Wotao Yin (2016). Convergence rate analysis of several splitting schemes. Splitting methods in communication, imaging, science, and engineering.

[2] Damek Davis and Wotao Yin (2017). A three-operator splitting scheme and its optimization applications. Set-valued and variational analysis.

[3] Ernest K. Ryu and Wotao Yin (2019). Proximal-proximal-gradient method. Journal of Computational Mathematics.

[4] Patrick L. Combettes (2004). Solving monotone inclusions via compositions of nonexpansive averaged operators. Optimization.

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Computational Statistics

Lecture 8

Yixuan Qiu

2022-11-02

Optimization

Last Time

Today's Topics

Motivation

Motivation

Douglas-Rachford Splitting

Douglas-Rachford Splitting

DRS Algorithm

Convergence Property [1,4]

Convergence Property [1,4]

Convergence Property [1,4]

Example - \(P_{C\cap D}(u)\)

Example - \(P_{C\cap D}(u)\)

Example - \(P_{C\cap D}(u)\)

Davis-Yin Splitting

Motivation

Davis-Yin Splitting

Some Remarks

Convergence Property [2]

Convergence Property [2]

Convergence Property [2]

Convergence Property [2]

Accelerations

Proximal-Proximal-Gradient Algorithm

Further Extensions

The Consensus Trick

The Consensus Trick

Proximal Operators

PPG Algorithm

PPG Algorithm

Convergence Property [3]

Convergence Property [3]

Accelerations

Summary

References

Optimization

Help

Convergence Property ^[1,4]

Convergence Property ^[1,4]

Convergence Property ^[1,4]

Convergence Property ^[2]

Convergence Property ^[2]

Convergence Property ^[2]

Convergence Property ^[2]

Convergence Property ^[3]

Convergence Property ^[3]