+ - 0:00:00
Notes for current slide
Notes for next slide

Computational Statistics

Lecture 8

Yixuan Qiu

2022-11-02

1 / 35

Optimization

2 / 35

Last Time

  • Dealing with nonsmooth problems of the form F(x)=f(x)+h(x)

  • What if we have more than one nonsmooth term?

  • This is common in statistical models, e.g., by adding multiple regularization terms/constraints to the model

3 / 35

Today's Topics

  • Douglas-Rachford splitting method

  • Davis-Yin splitting method

  • Proximal-proximal-gradient algorithm

4 / 35

Motivation

  • For the nonsmooth function F(x)=f(x)+h(x), where f(x) is smooth and h(x) is nonsmooth, we have the proximal gradient descent algorithm

x(k+1)=proxαh(x(k)αf(x(k))),k=0,1,

  • If we let f(x)=0, and then we get the proximal point algorithm (PPA) for minimizing a nonsmooth function h(x):

x(k+1)=proxαh(x(k)),k=0,1,

5 / 35

Motivation

  • Now consider F(x)=g(x)+h(x), where both g(x) and h(x) are nonsmooth

  • Of course, we can try to compute proxα(g+h), and then apply PPA

  • But unfortunately, in general proxα(g+h)proxαg+proxαh or other simple combination of proxαg and proxαh

  • So even if we can compute proxαg and proxαh individually, there is no obvious way to solve minxF(x)

6 / 35

Douglas-Rachford Splitting

7 / 35

Douglas-Rachford Splitting

The Douglas-Rachford splitting (DRS) algorithm is a useful method to solve the problem

minx F(x):=minx g(x)+h(x),

where g(x) and h(x) are convex functions, possibly nonsmooth.

The algorithm relies on proxαg and proxαh.

8 / 35

DRS Algorithm

DRS uses the following iteration scheme: given an initial value y(0), for k=0,1,, iterate

x(k+1)=proxαg(y(k))y(k+1)=y(k)+proxαh(2x(k+1)y(k))x(k+1)

  • The roles of g(x) and h(x) are not symmetric

  • The step size α>0 can be chosen arbitrarily, but it may affect the convergence speed

9 / 35

Convergence Property [1,4]

We first present the convergence property of the y(k) sequence. Define T(y)=2proxαh(2proxαg(y)y)2proxαg(y)+y, and then it is easy to see that y(k+1)=(y(k)+T(y(k)))/2.

  • y(k) converges to some fixed point y of T(), i.e., T(y)=y.

  • y(k)y is monotonically nonincreasing.

  • y(k+1)y(k)=T(y(k))y(k)/2 is monotonically nonincreasing and converges to 0.

10 / 35

Convergence Property [1,4]

(continued from last slide)

  • We have the asymptotic rate y(k+1)y(k)2=o(1/k).

  • Nonasymptotic rate y(k+1)y(k)2y(0)y2k+1.

11 / 35

Convergence Property [1,4]

y is connected with the optimization problem minx g(x)+h(x) via the following important conclusion:

If y is a point such that T(y)=y, then x=proxαg(y) is an optimal point of minx g(x)+h(x).

It has also been proved that x(k) converges to some optimal point of minx g(x)+h(x).

Convergence rates will be introduced in the Davis-Yin splitting algorithm, which is a generalization of DRS.

12 / 35

Example - PCD(u)

Problem: given two closed convex sets C and D, CD, compute the projection operator PCD(u).

In many cases we have simple PC and PD operations, but CD may be complicated. For example:

13 / 35

Example - PCD(u)

The optimization problem becomes

minx 12xu2s.t. xC, xD.

Or equivalently,

minx 12xu2+IC(x)+ID(x), where IC(x)=0 if xC, and IC(x)= if xC.

14 / 35

Example - PCD(u)

Now let g(x)=12xu2+IC(x), and then

proxαg(z)=argminx 12xu2+IC(x)+12αxz2=argminx (α+1)x22x(αu+z)2α+IC(x)=argminxC x(α+1)1(αu+z)2=PC((α+1)1(αu+z)).

Also, let h(x)=ID(x), and proxαh(z)=PD(z). Then proceed using the DRS algorithm.

15 / 35

Davis-Yin Splitting

16 / 35

Motivation

  • Suppose that f(x) is a smooth convex function, and g(x) and h(x) are possibly nonsmooth convex functinos

  • Recall that proximal gradient descent minimizes f(x)+h(x)

  • DRS algorithm minimizes g(x)+h(x)

  • To unify the above two, we want to find an algorithm to minimize F(x)=f(x)+g(x)+h(x)

  • Later we will also see that such a "three-operator" problem is the key to handling the sum of an arbitrary number of functions

17 / 35

Davis-Yin Splitting

Consider the optimization problem

minx F(x):=minx f(x)+g(x)+h(x),

where f(x) is convex and L-smooth, and g(x) and h(x) are possibly nonsmooth convex functions.

The Davis-Yin splitting (DYS) algorithm uses the following iteration scheme: given an initial value y(0) and a step size 0<α<2/L, for k=0,1,, iterate

x(k+1)=proxαg(y(k))y(k+1)=y(k)+proxαh(2x(k+1)y(k)αf(y(k)))x(k+1)

18 / 35

Some Remarks

  • For the smooth component f(x), we compute its gradient f(x)

  • For the nonsmooth terms g(x) and h(x), we use their proximal operators

  • The algorithm is similar to DRS, with an additional gradient descent term

19 / 35

Convergence Property [2]

Define T(y)=proxαh(2proxαg(y)yαf(proxαg(y)))proxαg(y)+y, and then y(k+1)=T(y(k)).

Similar to DRS, the following properties of the y(k) sequence hold:

  • y(k) converges to some fixed point y of T().

  • y(k)y is monotonically nonincreasing.

  • y(k+1)y(k)=T(y(k))y(k) is monotonically nonincreasing and converges to 0.

20 / 35

Convergence Property [2]

The following convergence result is on the x(k) variables.

Suppose h(x) is Lipschitz continuous on the closed ball B(0,(1+αL)y(0)y), then

(f+g+h)(x(k))(f+g+h)(x)=o(1k+1).

21 / 35

Convergence Property [2]

The following convergence result is on the x(k) variables.

Suppose h(x) is Lipschitz continuous on the closed ball B(0,(1+αL)y(0)y), then

(f+g+h)(x(k))(f+g+h)(x)=o(1k+1).

It seems that the convergence rate is not significantly better than a subgradient method, but it shows that by properly averaging the iterates x(k), we can get a faster speed.

21 / 35

Convergence Property [2]

Let x¯(k)=2(k+1)(k+2)i=0k(i+1)x(i), and then

(f+g+h)(x¯(k))(f+g+h)(x)=O(1k+1).

22 / 35

Accelerations

  • There exist several accelerated variants of the DYS algorithm under stronger assumptions

  • See [2] for details

23 / 35

Proximal-Proximal-Gradient Algorithm

24 / 35

Further Extensions

  • DYS has given an elegant solution to the nonsmooth convex optimization problem minx f(x)+g(x)+h(x)

  • But what if we have more than three components?

  • For smooth components, easy:

  • The gradients are additive, so if we have smooth components f1(x),,fm(x), then just let f=f1++fm, and hence f=f1++fm

  • Directly apply DYS as usual (of course, the smoothness parameter L may change, which affects the step size α)

25 / 35

The Consensus Trick

  • Proximal operators are in general not additive, but we can use the "consensus trick".

  • Suppose we want to minimize F(x)=f(x)+i=1mhi(x), where f(x) is smooth and hi(x) may be nonsmooth. Then we find that

xargminx f(x)+i=1mhi(x) (x,,x)argminx(1),,x(m)x(1)==x(m) f(x¯)+i=1mhi(x(i)), where x¯=m1i=1mx(i).

26 / 35

The Consensus Trick

Therefore, if we want xRd, then we can work on a "stacked" variable x=(x(1),,x(m))Rmd, and optimize the function F~(x):=f(x¯)+IC(x)+h~(x), where

  • x¯=m1i=1mx(i)
  • C={x:x(1)==x(m)}
  • h~(x)=i=1mhi(x(i))

We have shown that an optimal point of F~(x) is (x,,x), where x is an optimal point of the original problem.

27 / 35

Proximal Operators

  • More importantly, we can show that

  • proxαIC(x)=(x¯,,x¯)

  • proxαh~(x)=(proxαh1(x(1)),,proxαhm(x(m)))

  • This means that we only need to evaluate proxαhi() individually!

  • This is essentially the key idea of proximal-proximal-gradient (PPG) algorithm

28 / 35

PPG Algorithm

Consider the optimization problem

minxRdF(x):=minxRd r(x)+1ni=1n(fi(x)+gi(x))

  • r(x), fi(x), and gi(x) are convex functions

  • fi(x) are differentiable

  • r(x) and gi(x) have simple proximal operators

  • Generalization of DYS

29 / 35

PPG Algorithm

Given an initial value z(0)=(z(1)(0),,z(n)(0)), iterate

x(k+1/2)=proxαr(z¯(k))x(i)(k+1)=proxαgi(2x(k+1/2)z(i)(k)αfi(x(k+1/2))),i=1,,nz(i)(k+1)=z(i)(k)+x(i)(k+1)x(k+1/2),i=1,,n

Remarks:

  • z(k)=(z(1)(k),,z(n)(k))Rnd, subscripts for copies, superscripts for iteration numbers
  • z¯(k)=n1i=1nz(i)(k)Rd
  • Updates of x(i)(k+1) and z(i)(k+1) can be parallelized across i
30 / 35

Convergence Property [3]

We first state the convergence of the z(k) variables.

Suppose that f1(x),fn(x) are differentiable and L-smooth. Select a step size 0<α<3/(2L), and denote p(z(k))=α1(z(k+1)z(k)). Then p(z(k))0 monotonically with the rate p(z(k))=O(1/k).

31 / 35

Convergence Property [3]

For the x-variables, we have x(k+1/2)x and x(i)(k)x for all i=1,,n, where x is an optimal point of F(x).

If in addition, g¯(x)=n1i=1ngi(x) is Lipschitz continuous, then

F(x(k+1/2))F(x)=O(1/k).

Finally, let xavg(k+1/2)=k1j=1kx(j+1/2), then under the same assumptions we have

F(xavg(k+1/2))F(x)=O(1/k).

32 / 35

Accelerations

Other faster convergence rates with stronger assumptions can be found in [3].

33 / 35

Summary

  • We have summarized three important algorithms for nonsmooth optimization problems

  • DRS DYS PPG with increasing generality

  • These methods are very useful for statistical models with multiple constraints and/or regularization terms

  • They typically have much better convergence speed than subgradient methods

34 / 35

References

[1] Damek Davis and Wotao Yin (2016). Convergence rate analysis of several splitting schemes. Splitting methods in communication, imaging, science, and engineering.

[2] Damek Davis and Wotao Yin (2017). A three-operator splitting scheme and its optimization applications. Set-valued and variational analysis.

[3] Ernest K. Ryu and Wotao Yin (2019). Proximal-proximal-gradient method. Journal of Computational Mathematics.

[4] Patrick L. Combettes (2004). Solving monotone inclusions via compositions of nonexpansive averaged operators. Optimization.

35 / 35

Optimization

2 / 35
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow