Dealing with nonsmooth problems of the form F(x)=f(x)+h(x)
What if we have more than one nonsmooth term?
This is common in statistical models, e.g., by adding multiple regularization terms/constraints to the model
Douglas-Rachford splitting method
Davis-Yin splitting method
Proximal-proximal-gradient algorithm
x(k+1)=proxαh(x(k)−α⋅∇f(x(k))),k=0,1,…
x(k+1)=proxαh(x(k)),k=0,1,…
Now consider F(x)=g(x)+h(x), where both g(x) and h(x) are nonsmooth
Of course, we can try to compute proxα(g+h), and then apply PPA
But unfortunately, in general proxα(g+h)≠proxαg+proxαh or other simple combination of proxαg and proxαh
So even if we can compute proxαg and proxαh individually, there is no obvious way to solve minxF(x)
The Douglas-Rachford splitting (DRS) algorithm is a useful method to solve the problem
minx F(x):=minx g(x)+h(x),
where g(x) and h(x) are convex functions, possibly nonsmooth.
The algorithm relies on proxαg and proxαh.
DRS uses the following iteration scheme: given an initial value y(0), for k=0,1,…, iterate
x(k+1)=proxαg(y(k))y(k+1)=y(k)+proxαh(2x(k+1)−y(k))−x(k+1)
The roles of g(x) and h(x) are not symmetric
The step size α>0 can be chosen arbitrarily, but it may affect the convergence speed
We first present the convergence property of the y(k) sequence. Define T(y)=2proxαh(2proxαg(y)−y)−2proxαg(y)+y, and then it is easy to see that y(k+1)=(y(k)+T(y(k)))/2.
y(k) converges to some fixed point y∗ of T(⋅), i.e., T(y∗)=y∗.
∥y(k)−y∗∥ is monotonically nonincreasing.
∥y(k+1)−y(k)∥=∥T(y(k))−y(k)∥/2 is monotonically nonincreasing and converges to 0.
(continued from last slide)
We have the asymptotic rate ∥y(k+1)−y(k)∥2=o(1/k).
Nonasymptotic rate ∥y(k+1)−y(k)∥2≤∥y(0)−y∗∥2k+1.
y∗ is connected with the optimization problem minx g(x)+h(x) via the following important conclusion:
If y∗ is a point such that T(y∗)=y∗, then x∗=proxαg(y∗) is an optimal point of minx g(x)+h(x).
It has also been proved that x(k) converges to some optimal point of minx g(x)+h(x).
Convergence rates will be introduced in the Davis-Yin splitting algorithm, which is a generalization of DRS.
Problem: given two closed convex sets C and D, C∩D≠∅, compute the projection operator PC∩D(u).
In many cases we have simple PC and PD operations, but C∩D may be complicated. For example:

The optimization problem becomes
minx 12∥x−u∥2s.t. x∈C, x∈D.
Or equivalently,
minx 12∥x−u∥2+IC(x)+ID(x), where IC(x)=0 if x∈C, and IC(x)=∞ if x∉C.
Now let g(x)=12∥x−u∥2+IC(x), and then
proxαg(z)=argminx 12∥x−u∥2+IC(x)+12α∥x−z∥2=argminx (α+1)∥x∥2−2x′(αu+z)2α+IC(x)=argminx∈C ∥x−(α+1)−1(αu+z)∥2=PC((α+1)−1(αu+z)).
Also, let h(x)=ID(x), and proxαh(z)=PD(z). Then proceed using the DRS algorithm.
Suppose that f(x) is a smooth convex function, and g(x) and h(x) are possibly nonsmooth convex functinos
Recall that proximal gradient descent minimizes f(x)+h(x)
DRS algorithm minimizes g(x)+h(x)
To unify the above two, we want to find an algorithm to minimize F(x)=f(x)+g(x)+h(x)
Later we will also see that such a "three-operator" problem is the key to handling the sum of an arbitrary number of functions
Consider the optimization problem
minx F(x):=minx f(x)+g(x)+h(x),
where f(x) is convex and L-smooth, and g(x) and h(x) are possibly nonsmooth convex functions.
The Davis-Yin splitting (DYS) algorithm uses the following iteration scheme: given an initial value y(0) and a step size 0<α<2/L, for k=0,1,…, iterate
x(k+1)=proxαg(y(k))y(k+1)=y(k)+proxαh(2x(k+1)−y(k)−α∇f(y(k)))−x(k+1)
For the smooth component f(x), we compute its gradient ∇f(x)
For the nonsmooth terms g(x) and h(x), we use their proximal operators
The algorithm is similar to DRS, with an additional gradient descent term
Define T(y)=proxαh(2proxαg(y)−y−α∇f(proxαg(y)))−proxαg(y)+y, and then y(k+1)=T(y(k)).
Similar to DRS, the following properties of the y(k) sequence hold:
y(k) converges to some fixed point y∗ of T(⋅).
∥y(k)−y∗∥ is monotonically nonincreasing.
∥y(k+1)−y(k)∥=∥T(y(k))−y(k)∥ is monotonically nonincreasing and converges to 0.
The following convergence result is on the x(k) variables.
Suppose h(x) is Lipschitz continuous on the closed ball B(0,(1+αL)∥y(0)−y∗∥), then
(f+g+h)(x(k))−(f+g+h)(x∗)=o(1√k+1).
The following convergence result is on the x(k) variables.
Suppose h(x) is Lipschitz continuous on the closed ball B(0,(1+αL)∥y(0)−y∗∥), then
(f+g+h)(x(k))−(f+g+h)(x∗)=o(1√k+1).
It seems that the convergence rate is not significantly better than a subgradient method, but it shows that by properly averaging the iterates x(k), we can get a faster speed.
Let ¯x(k)=2(k+1)(k+2)k∑i=0(i+1)x(i), and then
(f+g+h)(¯x(k))−(f+g+h)(x∗)=O(1k+1).
There exist several accelerated variants of the DYS algorithm under stronger assumptions
See [2] for details
DYS has given an elegant solution to the nonsmooth convex optimization problem minx f(x)+g(x)+h(x)
But what if we have more than three components?
For smooth components, easy:
The gradients are additive, so if we have smooth components f1(x),…,fm(x), then just let f=f1+⋯+fm, and hence ∇f=∇f1+⋯+∇fm
Directly apply DYS as usual (of course, the smoothness parameter L may change, which affects the step size α)
Proximal operators are in general not additive, but we can use the "consensus trick".
Suppose we want to minimize F(x)=f(x)+∑mi=1hi(x), where f(x) is smooth and hi(x) may be nonsmooth. Then we find that
x∗∈argminx f(x)+m∑i=1hi(x)⇔ (x∗,…,x∗)∈argminx(1),…,x(m)x(1)=⋯=x(m) f(¯x)+m∑i=1hi(x(i)), where ¯x=m−1∑mi=1x(i).
Therefore, if we want x∗∈Rd, then we can work on a "stacked" variable x=(x(1),…,x(m))∈Rmd, and optimize the function ~F(x):=f(¯x)+IC(x)+~h(x), where
We have shown that an optimal point of ~F(x) is (x∗,…,x∗), where x∗ is an optimal point of the original problem.
More importantly, we can show that
proxαIC(x)=(¯x,…,¯x)
proxα~h(x)=(proxαh1(x(1)),…,proxαhm(x(m)))
This means that we only need to evaluate proxαhi(⋅) individually!
This is essentially the key idea of proximal-proximal-gradient (PPG) algorithm
Consider the optimization problem
minx∈RdF(x):=minx∈Rd r(x)+1nn∑i=1(fi(x)+gi(x))
r(x), fi(x), and gi(x) are convex functions
fi(x) are differentiable
r(x) and gi(x) have simple proximal operators
Generalization of DYS
Given an initial value z(0)=(z(0)(1),…,z(0)(n)), iterate
x(k+1/2)=proxαr(¯z(k))x(k+1)(i)=proxαgi(2x(k+1/2)−z(k)(i)−α∇fi(x(k+1/2))),i=1,…,nz(k+1)(i)=z(k)(i)+x(k+1)(i)−x(k+1/2),i=1,…,n
Remarks:
We first state the convergence of the z(k) variables.
Suppose that f1(x)…,fn(x) are differentiable and L-smooth. Select a step size 0<α<3/(2L), and denote p(z(k))=α−1(z(k+1)−z(k)). Then ∥p(z(k))∥→0 monotonically with the rate ∥p(z(k))∥=O(1/√k).
For the x-variables, we have x(k+1/2)→x∗ and x(k)(i)→x∗ for all i=1,…,n, where x∗ is an optimal point of F(x).
If in addition, ¯g(x)=n−1∑ni=1gi(x) is Lipschitz continuous, then
F(x(k+1/2))−F(x∗)=O(1/√k).
Finally, let x(k+1/2)avg=k−1∑kj=1x(j+1/2), then under the same assumptions we have
F(x(k+1/2)avg)−F(x∗)=O(1/k).
Other faster convergence rates with stronger assumptions can be found in [3].
We have summarized three important algorithms for nonsmooth optimization problems
DRS → DYS → PPG with increasing generality
These methods are very useful for statistical models with multiple constraints and/or regularization terms
They typically have much better convergence speed than subgradient methods
[1] Damek Davis and Wotao Yin (2016). Convergence rate analysis of several splitting schemes. Splitting methods in communication, imaging, science, and engineering.
[2] Damek Davis and Wotao Yin (2017). A three-operator splitting scheme and its optimization applications. Set-valued and variational analysis.
[3] Ernest K. Ryu and Wotao Yin (2019). Proximal-proximal-gradient method. Journal of Computational Mathematics.
[4] Patrick L. Combettes (2004). Solving monotone inclusions via compositions of nonexpansive averaged operators. Optimization.
Keyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide | 
| ↓, →, Pg Dn, Space, j | Go to next slide | 
| Home | Go to first slide | 
| End | Go to last slide | 
| Number + Return | Go to specific slide | 
| b / m / f | Toggle blackout / mirrored / fullscreen mode | 
| c | Clone slideshow | 
| p | Toggle presenter mode | 
| t | Restart the presentation timer | 
| ?, h | Toggle this help | 
| Esc | Back to slideshow |