Subgradient and subdifferential
Subgradient methods
Proximal gradient descent
The simplest gradient descent method deals with unconstrained and smooth problems
Projected gradient descent extends it to constrained problems
Today we discuss how to deal with nonsmooth problems
A subgradient of a convex function f:Rd→R at a point x is any vector g∈Rd such that
f(y)≥f(x)+g′(y−x),∀y∈X, where X is the domain of f.
A subgradient of a convex function f:Rd→R at a point x is any vector g∈Rd such that
f(y)≥f(x)+g′(y−x),∀y∈X, where X is the domain of f.
Recall that if f is differentiable, then f is convex if and only if
f(y)≥f(x)+[∇f(x)]′(y−x),∀x,y∈X.
The set of all subgradients of f at x is called the subdifferential:
∂f(x)={g∈Rd:g is a subgradient of f at x}.
Subgradient always exists for convex functions (subdifferential is nonempty)
∂f(x) is closed and convex
f is differentiable at x if and only if the subdifferential is a singleton containing the gradient: ∂f(x)={∇f(x)}
Linear combination: ∂(α1f1+α2f2)=α1⋅∂f1+α2⋅∂f2, where α1,α2≥0, and for two sets C1,C2, αC1+βC2={αx1+βx2:x1∈C1,x2∈C2}
Affine transformation: If g(x)=f(Ax+b), then ∂g(x)=A′∂f(Ax+b)
Chain rule: Let f be convex, g be convex, differentiable, and nondecreasing. Denote by h(x)=g(f(x)), and then ∂h(x)=g′(f(x))∂f(x)
For f=∥x∥:
If x≠0, ∂f(x)={x/∥x∥}
If x=0, ∂f(x)={z:∥z∥≤1}
For f=∥x∥1:
∂f(x)={(g1,…,gd)′:gi∈Gi}, where
If xi≠0, Gi={sign(xi)}
If xi=0, Gi=[−1,1]={z:|z|≤1}
For f(X)=λmax(X) defined on all symmetric d×d matrices, let λ1≥⋯≥λd be the eigenvalues and γ1,…,γd be the associated eigenvectors. Then
∂f(X)={T∈K:(λ1−λi)γ′iTγi=0, i=2,…,d}, where K={T∈Rd:T′=T,T⪰0,tr(T)=1}.
Special cases:
There is a very general rule: x∗ is an optimal point of f(x) if and only if 0 is a subgradient of f at x∗:
f(x∗)=minxf(x)⇔0∈∂f(x∗).
This is true even for nondifferentiable f.
If f is differentiable, then the condition reduces to ∇f(x∗)=0.
Recall that gradient descent solves minx f(x), but it requires f to be differentiable.
Recall that gradient descent solves minx f(x), but it requires f to be differentiable.
Subgradient descent loosens this condition by replacing the gradient with a subgradient: given an initial value x(0), iterate
x(k+1)=x(k)−αk⋅g(k),k=0,1,…,
where αk is the step size, and g(k)∈∂f(x(k)) is any subgradient of f at x(k).
If the optimization problem is constrained on a closed and non-empty convex set C, then similar to projected gradient descent, we can use the projected subgradient descent method:
x(k+1)=PC(x(k)−αk⋅g(k)),k=0,1,….
Does (projected) subgradient descent converge?
How fast does it converge?
How to pick αk?
Subgradient descent can be viewed as a special case of projected subgradient descent
Their convergence properties are similar
We show the results of projected subgradient descent for generality
The projected subgradient descent iterations satisfy
∥x(k+1)−x∗∥2≤∥x(k)−x∗∥2−2αk(f(x(k))−f(x∗))+α2k∥g(k)∥2.
The projected subgradient descent iterations satisfy
∥x(k+1)−x∗∥2≤∥x(k)−x∗∥2−2αk(f(x(k))−f(x∗))+α2k∥g(k)∥2.
Proof: ∥x(k+1)−x∗∥2=∥PC(x(k)−αkg(k))−x∗∥2≤∥x(k)−αkg(k)−x∗∥2(projection is nonexpansive)=∥x(k)−x∗∥2−2αk(x(k)−x∗)′g(k)+α2k∥g(k)∥2
Subgradient satisfy f(x∗)−f(x(k))≥(x∗−x(k))′g(k), so the lemma holds.
Suppose that f:Rd→R is convex and Lipschitz continuous on C with constant L. Then
f(x(k)best)−f(x∗)≤∥x(0)−x∗∥2+L2∑ki=1α2i2∑ki=0αi.
For subgradient methods we can no longer guarantee the objective function values to be nonincreasing, so we select the best iterate x(k)best, defined by
f(x(k)best)=mini=0,…,kf(x(i)).
With a fixed step size αk≡α, f(x(k)best)−f(x∗)≤∥x(0)−x∗∥22αk+L2α2.
With a diminishing step size αk that satisfies ∑kα2k<∞ and ∑kαk=∞, we have f(x(k)best)→f(x∗).
The "best" choice is αk=O(1/√k), which gives f(x(k)best)−f(x∗)∼∥x(0)−x∗∥2+L2log(k)√k.
Applying the fundamental lemma recursively, and we have
∥x(k+1)−x∗∥2≤∥x(0)−x∗∥2−2k∑i=0αi(f(x(i))−f(x∗))+k∑i=0α2i∥g(i)∥2.
Rearranging the terms gives
2k∑i=0αi(f(x(i))−f(x∗))≤∥x(0)−x∗∥2−∥x(k+1)−x∗∥2+k∑i=0α2i∥g(i)∥2≤∥x(0)−x∗∥2+L2k∑i=0α2i.
If in addition, f is strongly convex with parameter m>0, then with a step size αk=2/(m(k+1)), we have f(x(k)best)−f(x∗)≤2L2m⋅1k+1.
Overall, subgradient methods have slower convergence rates than gradient descent counterparts.
If f is Lipschitz continuous, then the optimization error decays at the rate of O(1/√k) (gradient descent is O(1/k)).
If f is Lipschitz continuous and strongly convex, then the optimization error decays at the rate of O(1/k) (gradient descent is O(ρk)).
Although subgradient methods are very simple and general in solving nonsmooth convex optimization problems
They have undesirable convergence speed
In many statistical models, the nonsmooth function we want to optimize has a special structure: F(x)=f(x)+h(x), where f is convex and smooth, and h is convex but possibly nonsmooth
For the Lasso problem, F(β)=f(β)+h(β), where f(β)=12∥y−Xβ∥2, and h(β)=λ∥β∥1
Constrained problems minx∈Cf(x) can also be written as F(x)=f(x)+h(x), where h(x)=IC(x)={0,x∈C∞,x∉C.
The proximal gradient descent algorithm solves minxF(x) using the following iteration scheme: given an initial value x(0), iterate
x(k+1)=proxαh(x(k)−α⋅∇f(x(k))),k=0,1,…,
where proxαh(⋅) is the proximal operator (defined later) of h with step size α.
The proximal operator of a convex function h with step size α is defined as
proxαh(x)=argminu h(u)+12α∥u−x∥2.
A Strongly convex optimization problem
But has closed form for many h functions
proxαh(x)={(1−α/∥x∥)x,∥x∥≥α0,∥x∥<α.
If h(x)=(1/2)∥x∥2, then proxαh(x)=(1+α)−1x.
If h(x)=(1/2)x′Ax+b′x+c, where A is positive definite, then
proxαh(x)=(I+αA)−1(x−αb).
(Sα(x))i=⎧⎨⎩xi−α,xi>α0,|xi|≤αxi+α,xi<−α.
(Sα(x))i=⎧⎨⎩xi−α,xi>α0,|xi|≤αxi+α,xi<−α.
If h(x)=IC(x), then proxαh(x)=PC(x), the projection operator.
This means that proximal gradient descent reduces to projected gradient descent if h(x)=IC(x).
If h(x)=a⋅g(x)+b, a>0, then proxαh(x)=proxaαg(x)
If h(x)=g(ax+b), a≠0, then proxαh(x)=(proxa2αg(ax+b)−b)/a
If h(x)=g(x)+a′x+b, then proxαh(x)=proxαg(x−aα)
If h(x)=g(x)+(ρ/2)∥x−a∥2, then proxαh(x)=prox~αg((~α/α)x−(ρ~α)a), where ~α=α/(1+αρ)
We first present a nice property of the proximal gradient descent algorithm. The proof can be found at [7].
Suppose f is convex and L-smooth. Take the step size α=1/L, and then
F(x(k+1))≤F(x(k)),∥x(k+1)−x∗∥≤∥x(k)−x∗∥.
The main convergence theorem: Suppose f is convex and L-smooth. Take the step size α=1/L, and then
F(x(k))−F(x∗)≤L∥x(0)−x∗∥22k.
If in addition, f is strongly convex with parameter m>0, then with a fixed step size α=1/L, we have
∥x(k)−x∗∥2≤(1−mL)k∥x(0)−x∗∥2.
Proximal gradient descent matches the convergence rate of gradient descent.
Useful when proxαh can be efficiently evaluated.
If f is L-smooth, then the optimization error, F(x(k))−F(x∗), decays at the rate of O(1/k).
If f is L-smooth and m-strongly-convex, then the optimization error decays exponentially fast at the rate of O(ρk), where ρ=1−m/L.
It turns out that the proximal gradient descent algorithm can be accelerated almost for free, using a somewhat magical technique.
The accelerated version: given an initial value x(0)=x(−1), iterate
y(k+1)=x(k)+k−1k+2(x(k)−x(k−1)),x(k+1)=proxαh(y(k+1)−α⋅∇f(y(k+1))),k=0,1,…
The accelerated convergence rate: Suppose f is convex and L-smooth. Take the step size α=1/L, and then
F(x(k))−F(x∗)≤2L∥x(0)−x∗∥2(k+1)2.
If in addition, f is strongly convex with parameter m>0, then with κ=L/m, the iteration becomes
y(k+1)=x(k)+√κ−1√κ+1(x(k)−x(k−1)),x(k+1)=proxαh(y(k+1)−α⋅∇f(y(k+1))),k=0,1,…
With a fixed step size α=1/L, we have
F(x(k))−F(x∗)≤(1−1√κ)k(F(x(0))−F(x∗)+m∥x(0)−x∗∥22).
(Projected) subgradient descent is a general method to optimize a nonsmooth convex function
However, its convergence is typically slow
If the nonsmooth part has a simple proximal operator, then the proximal gradient descent algorithm is much preferred
Also get accelerations "for free"
[1] Stephen Boyd and Lieven Vandenberghe (2004). Convex optimization. Cambridge University Press.
[2] Robert M. Gower (2018). Convergence theorems for gradient descent. Lecture notes for Statistical Optimization.
[3] Claude Vallee, Danielle Fortune, and Camelia Lerintiu (2008). Subdifferential of the Largest Eigenvalue of a Symmetrical Matrix Application of Direct Projection Methods. Analysis and Applications.
[4] Adrian S. Lewis (1999). Nonsmooth analysis of eigenvalues. Mathematical Programming.
[5] https://yuxinchen2020.github.io/ele522_optimization/lectures /subgradient_methods.pdf
[6] Neal Parikh and Stephen Boyd (2014). Proximal algorithms. Foundations and trends® in Optimization.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |