+ - 0:00:00
Notes for current slide
Notes for next slide

Computational Statistics

Lecture 7

Yixuan Qiu

2023-10-25

1 / 41

Optimization

2 / 41

Today's Topics

  • Subgradient and subdifferential

  • Subgradient methods

  • Proximal gradient descent

3 / 41

Last Time

  • The simplest gradient descent method deals with unconstrained and smooth problems

  • Projected gradient descent extends it to constrained problems

  • Today we discuss how to deal with nonsmooth problems

4 / 41

Subgradient

A subgradient of a convex function f:RdR at a point x is any vector gRd such that

f(y)f(x)+g(yx),yX, where X is the domain of f.

5 / 41

Subgradient

A subgradient of a convex function f:RdR at a point x is any vector gRd such that

f(y)f(x)+g(yx),yX, where X is the domain of f.

Recall that if f is differentiable, then f is convex if and only if

f(y)f(x)+[f(x)](yx),x,yX.

5 / 41

Subdifferential

The set of all subgradients of f at x is called the subdifferential:

f(x)={gRd:g is a subgradient of f at x}.

6 / 41

Properties

  • Subgradient always exists for convex functions (subdifferential is nonempty)

  • f(x) is closed and convex

  • f is differentiable at x if and only if the subdifferential is a singleton containing the gradient: f(x)={f(x)}

7 / 41

Properties

  • Linear combination: (α1f1+α2f2)=α1f1+α2f2, where α1,α20, and for two sets C1,C2, αC1+βC2={αx1+βx2:x1C1,x2C2}

  • Affine transformation: If g(x)=f(Ax+b), then g(x)=Af(Ax+b)

  • Chain rule: Let f be convex, g be convex, differentiable, and nondecreasing. Denote by h(x)=g(f(x)), and then h(x)=g(f(x))f(x)

8 / 41

Examples

For f=x:

  • If x0, f(x)={x/x}

  • If x=0, f(x)={z:z1}

For f=x1:

  • f(x)={(g1,,gd):giGi}, where

  • If xi0, Gi={sign(xi)}

  • If xi=0, Gi=[1,1]={z:|z|1}

9 / 41

Examples [3,4]

For f(X)=λmax(X) defined on all symmetric d×d matrices, let λ1λd be the eigenvalues and γ1,,γd be the associated eigenvectors. Then

f(X)={TK:(λ1λi)γiTγi=0, i=2,,d}, where K={TRd:T=T,T0,tr(T)=1}.

Special cases:

  • If λ1>λ2, then f(X)={f(X)}={γ1γ1}.
  • If λ1==λr>λr+1, then f(X)={i=1rwiγiγi:wRr,w0,i=1rwi=1}.
10 / 41

Relation to Optimality Condition

There is a very general rule: x is an optimal point of f(x) if and only if 0 is a subgradient of f at x:

f(x)=minxf(x)0f(x).

This is true even for nondifferentiable f.

If f is differentiable, then the condition reduces to f(x)=0.

11 / 41

Subgradient Methods

12 / 41

Subgradient Descent

Recall that gradient descent solves minx f(x), but it requires f to be differentiable.

13 / 41

Subgradient Descent

Recall that gradient descent solves minx f(x), but it requires f to be differentiable.

Subgradient descent loosens this condition by replacing the gradient with a subgradient: given an initial value x(0), iterate

x(k+1)=x(k)αkg(k),k=0,1,,

where αk is the step size, and g(k)f(x(k)) is any subgradient of f at x(k).

13 / 41

Projected Subgradient Descent

If the optimization problem is constrained on a closed and non-empty convex set C, then similar to projected gradient descent, we can use the projected subgradient descent method:

x(k+1)=PC(x(k)αkg(k)),k=0,1,.

14 / 41

Questions

  • Does (projected) subgradient descent converge?

  • How fast does it converge?

  • How to pick αk?

15 / 41

Convergence Analysis

  • Subgradient descent can be viewed as a special case of projected subgradient descent

  • Their convergence properties are similar

  • We show the results of projected subgradient descent for generality

16 / 41

A Fundamental Lemma

The projected subgradient descent iterations satisfy

x(k+1)x2x(k)x22αk(f(x(k))f(x))+αk2g(k)2.

17 / 41

A Fundamental Lemma

The projected subgradient descent iterations satisfy

x(k+1)x2x(k)x22αk(f(x(k))f(x))+αk2g(k)2.

Proof: x(k+1)x2=PC(x(k)αkg(k))x2x(k)αkg(k)x2(projection is nonexpansive)=x(k)x22αk(x(k)x)g(k)+αk2g(k)2

Subgradient satisfy f(x)f(x(k))(xx(k))g(k), so the lemma holds.

17 / 41

Convergence Property

Suppose that f:RdR is convex and Lipschitz continuous on C with constant L. Then

f(xbest(k))f(x)x(0)x2+L2i=1kαi22i=0kαi.

For subgradient methods we can no longer guarantee the objective function values to be nonincreasing, so we select the best iterate xbest(k), defined by

f(xbest(k))=mini=0,,kf(x(i)).

18 / 41

Implications

  • With a fixed step size αkα, f(xbest(k))f(x)x(0)x22αk+L2α2.

  • With a diminishing step size αk that satisfies kαk2< and kαk=, we have f(xbest(k))f(x).

  • The "best" choice is αk=O(1/k), which gives f(xbest(k))f(x)x(0)x2+L2log(k)k.

19 / 41

Proof

Applying the fundamental lemma recursively, and we have

x(k+1)x2x(0)x22i=0kαi(f(x(i))f(x))+i=0kαi2g(i)2.

Rearranging the terms gives

2i=0kαi(f(x(i))f(x))x(0)x2x(k+1)x2+i=0kαi2g(i)2x(0)x2+L2i=0kαi2.

20 / 41

Convergence with Strong Convexity [5]

If in addition, f is strongly convex with parameter m>0, then with a step size αk=2/(m(k+1)), we have f(xbest(k))f(x)2L2m1k+1.

21 / 41

Summary

  • Overall, subgradient methods have slower convergence rates than gradient descent counterparts.

  • If f is Lipschitz continuous, then the optimization error decays at the rate of O(1/k) (gradient descent is O(1/k)).

  • If f is Lipschitz continuous and strongly convex, then the optimization error decays at the rate of O(1/k) (gradient descent is O(ρk)).

22 / 41

Proximal Gradient Descent

23 / 41

Motivation

  • Although subgradient methods are very simple and general in solving nonsmooth convex optimization problems

  • They have undesirable convergence speed

  • In many statistical models, the nonsmooth function we want to optimize has a special structure: F(x)=f(x)+h(x), where f is convex and smooth, and h is convex but possibly nonsmooth

24 / 41

Examples

  • For the Lasso problem, F(β)=f(β)+h(β), where f(β)=12yXβ2, and h(β)=λβ1

  • Constrained problems minxCf(x) can also be written as F(x)=f(x)+h(x), where h(x)=IC(x)={0,xC,xC.

25 / 41

Proximal Gradient Descent

The proximal gradient descent algorithm solves minxF(x) using the following iteration scheme: given an initial value x(0), iterate

x(k+1)=proxαh(x(k)αf(x(k))),k=0,1,,

where proxαh() is the proximal operator (defined later) of h with step size α.

26 / 41

Proximal Operator

The proximal operator of a convex function h with step size α is defined as

proxαh(x)=argminu  h(u)+12αux2.

  • A Strongly convex optimization problem

  • But has closed form for many h functions

27 / 41

Examples

  • If h(x)=x, then

proxαh(x)={(1α/x)x,xα0,x<α.

  • If h(x)=(1/2)x2, then proxαh(x)=(1+α)1x.

  • If h(x)=(1/2)xAx+bx+c, where A is positive definite, then

proxαh(x)=(I+αA)1(xαb).

28 / 41

Examples

  • If h(x)=x1, then proxαh(x)=Sα(x), the soft-thresholding operator,

(Sα(x))i={xiα,xi>α0,|xi|αxi+α,xi<α.

  • If h(x)=IC(x), then proxαh(x)=PC(x), the projection operator.
29 / 41

Examples

  • If h(x)=x1, then proxαh(x)=Sα(x), the soft-thresholding operator,

(Sα(x))i={xiα,xi>α0,|xi|αxi+α,xi<α.

  • If h(x)=IC(x), then proxαh(x)=PC(x), the projection operator.

  • This means that proximal gradient descent reduces to projected gradient descent if h(x)=IC(x).

29 / 41

Properties [6]

  • If h(x)=ag(x)+b, a>0, then proxαh(x)=proxaαg(x)

  • If h(x)=g(ax+b), a0, then proxαh(x)=(proxa2αg(ax+b)b)/a

  • If h(x)=g(x)+ax+b, then proxαh(x)=proxαg(xaα)

  • If h(x)=g(x)+(ρ/2)xa2, then proxαh(x)=proxα~g((α~/α)x(ρα~)a), where α~=α/(1+αρ)

30 / 41

Convergence Property

We first present a nice property of the proximal gradient descent algorithm. The proof can be found at [7].

Suppose f is convex and L-smooth. Take the step size α=1/L, and then

F(x(k+1))F(x(k)),x(k+1)xx(k)x.

31 / 41

Convergence Property [7]

The main convergence theorem: Suppose f is convex and L-smooth. Take the step size α=1/L, and then

F(x(k))F(x)Lx(0)x22k.

32 / 41

Convergence with Strong Convexity [7]

If in addition, f is strongly convex with parameter m>0, then with a fixed step size α=1/L, we have

x(k)x2(1mL)kx(0)x2.

33 / 41

Summary

  • Proximal gradient descent matches the convergence rate of gradient descent.

  • Useful when proxαh can be efficiently evaluated.

  • If f is L-smooth, then the optimization error, F(x(k))F(x), decays at the rate of O(1/k).

  • If f is L-smooth and m-strongly-convex, then the optimization error decays exponentially fast at the rate of O(ρk), where ρ=1m/L.

34 / 41

One More Thing...

35 / 41

Nesterov Acceleration

It turns out that the proximal gradient descent algorithm can be accelerated almost for free, using a somewhat magical technique.

The accelerated version: given an initial value x(0)=x(1), iterate

y(k+1)=x(k)+k1k+2(x(k)x(k1)),x(k+1)=proxαh(y(k+1)αf(y(k+1))),k=0,1,

36 / 41

Convergence Property [8]

The accelerated convergence rate: Suppose f is convex and L-smooth. Take the step size α=1/L, and then

F(x(k))F(x)2Lx(0)x2(k+1)2.

37 / 41

Convergence with Strong Convexity [8]

If in addition, f is strongly convex with parameter m>0, then with κ=L/m, the iteration becomes

y(k+1)=x(k)+κ1κ+1(x(k)x(k1)),x(k+1)=proxαh(y(k+1)αf(y(k+1))),k=0,1,

With a fixed step size α=1/L, we have

F(x(k))F(x)(11κ)k(F(x(0))F(x)+mx(0)x22).

38 / 41

Summary

  • (Projected) subgradient descent is a general method to optimize a nonsmooth convex function

  • However, its convergence is typically slow

  • If the nonsmooth part has a simple proximal operator, then the proximal gradient descent algorithm is much preferred

  • Also get accelerations "for free"

39 / 41

References

[1] Stephen Boyd and Lieven Vandenberghe (2004). Convex optimization. Cambridge University Press.

[2] Robert M. Gower (2018). Convergence theorems for gradient descent. Lecture notes for Statistical Optimization.

[3] Claude Vallee, Danielle Fortune, and Camelia Lerintiu (2008). Subdifferential of the Largest Eigenvalue of a Symmetrical Matrix Application of Direct Projection Methods. Analysis and Applications.

[4] Adrian S. Lewis (1999). Nonsmooth analysis of eigenvalues. Mathematical Programming.

[5] https://yuxinchen2020.github.io/ele522_optimization/lectures /subgradient_methods.pdf

[6] Neal Parikh and Stephen Boyd (2014). Proximal algorithms. Foundations and trends® in Optimization.

40 / 41

Optimization

2 / 41
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow