+ - 0:00:00
Notes for current slide
Notes for next slide

Computational Statistics

Lecture 6

Yixuan Qiu

2022-10-19

1 / 48

Fundamental Problems

  • Numerical Linear Algebra

    • Computations on matrices and vectors
  • Optimization

    • Minimizing objective functions
  • Simulation/Sampling

    • Generating random numbers from statistical distributions
2 / 48

Optimization

3 / 48

Scope

  • Optimization is a very broad topic

  • We can only cover a small subset in this course

  • Focus on convex optimization problems arising from statistical and machine learning models

4 / 48

Scope

  • We will introduce both classical and modern optimization algorithms

  • Also discuss convergence results that are less common in statistics courses (but common in optimization courses)

  • There is no globally "best" optimization algorithm

  • There may be good algorithms for specific problems

5 / 48

Today's Topics

  • Convex functions

  • Gradient descent

  • Projected gradient descent

6 / 48

Concepts

  • Convex set

  • Convex function

  • Strictly convex function

  • Strongly convex function

7 / 48

Convex Set

We only consider subsets of the Euclidean space Rd.

A set CRd is convex, if

tx+(1t)yC,x,yC, 0t1.

The line segment connecting x and y is included in C.

8 / 48

Convex Function

f:RdR is convex, if its domain X is a convex set, and

f(tx+(1t)y)tf(x)+(1t)f(y),x,yX, 0t1.

The line segment between x and y lies above the function.

9 / 48

Strictly Convex Function

Similar to the definition of convex function, but

f(tx+(1t)y)<tf(x)+(1t)f(y),xy, 0<t<1.

Interpretation: The line segment between x and y strictly lies above the function, except for the end points x and y.

10 / 48

Strongly Convex Function

f is strongly convex with parameter m>0, if f(m/2)x2 is convex.

Interpretation: f is at least as convex as a quadratic function.

11 / 48

Relation

Strongly convex Strictly convex Convex

12 / 48

Properties

  • If f is differentiable and X, the domain of f, is convex, then f is convex if and only if

f(y)f(x)+[f(x)](yx),x,yX.

  • If f is twice differentiable and X is convex, then f is convex if and only if 2f(x)0 for all xX.

  • If f is twice differentiable and X is convex, then f is strongly convex with parameter m>0 if and only if 2f(x)mI for all xX.

13 / 48

Properties

  • The α-sublevel set of a convex function f, defined as

Cα={xX:f(x)α},

is convex for any α.

  • The converse is not true! Example: f(x)=ex.
14 / 48

Properties

  • A nonnegative weighted sum of convex functions f=w1f1++wmfm preserves convexity.

  • Composition with an affine mapping, g(x)=f(Ax+b), preserves convexity.

  • Pointwise maximum and supermum, f(x)=max{f1(x),,fm(x)}, preserves convexity.

15 / 48

Jensen's Inequality

If f is convex and X is a random variable supported on C, then

f(E(X))E(f(X)).

16 / 48

Examples

  • Univariate functions

    • eax for any aR
    • xa for a1 or a0 with X=R+
    • xa for 0a1 with X=R+
    • |x|a for a1 with X=R
  • Linear functions: f(x)=Ax+b

  • Quadratic functions: f(x)=(1/2)xAx+bx if A0

  • Norms: x (any norm, not limited to Euclidean norm)

  • Log-sum-exp (LSE) function: LSE(x1,,xn)=log(ex1++exn)

17 / 48

A Side Note on LSE

  • In some sense LSE can be called the "soft max" function, since

max(x1,,xn)LSE(x1,,xn)max(x1,,xn)+log(n)

  • The widely-used Softmax function should better be called "soft argmax"

softmax(x1,,xn)=(ex1iexi,,exniexi)

  • It can be verified that Softmax is the gradient of LSE
18 / 48

Convex Optimization Problem

A convex optimization problem is concerned with finding some x that attains the infimum infxCf(x), where f:RdR is a convex function with domain X, and CX is a convex set. Such an x, if it exists, is called a solution to the convex optimization problem.

In general, a convex optimization problem may have zero, one, or many solutions.

19 / 48

Concepts

  • The optimal value is infxCf(x)

  • A solution x is also called an optimal point

  • The set of all optimal points is called the optimal set

20 / 48

Properties

  • The optimal set, if nonempty, is convex

  • If the objective function f is strictly convex, then the problem has at most one optimal point

  • If f is differentiable, then x is an optimal point if and only if xC and [f(x)](yx)0,yC.

  • If C=X, then x is an optimal point if and only if f(x)=0.

21 / 48

Gradient Descent

22 / 48

Gradient Descent

Gradient descent (GD) is a well-known and widely-used technique to solve the unconstrained, smooth convex optimization problem

minx f(x),

where f is convex and differentiable with domain X=Rd.

The algorithm is simple: given an initial value x(0), iterate

x(k+1)=x(k)αkf(x(k)),k=0,1,,

where αk is the step size.

23 / 48

Questions

  • We are familiar with using GD to solve statistical models

  • But in this course we need to understand GD better

  • Does GD always converge?

  • How fast does it converge?

  • How to pick αk?

24 / 48

Lipschitz Continuity

We first introduce the concept of Lipschitz continuity, which plays an important role in analyzing GD.

A function f:RdRr is Lipschitz continuous with constant L>0, if f(x)f(y)Lxy,x,yRd.

For example, linear functions are Lipschitz continuous, whereas quadratic functions are not.

25 / 48

Convergence of GD

Suppose that f:RdR is convex and differentiable, and assume its gradient is Lipschitz continuous with constant L>0, i.e., f(x)f(y)Lxy,x,yRd. Then using a fixed step size αkα1/L, we have f(x(k))f(x)x(0)x22αk, where f(x) is the optimal value.

This is a non-asymptotic result.

26 / 48

Proof

Step 1: We show that if f(x) is Lipschitz continuous with constant L>0, then f(y)f(x)+[f(x)](yx)+L2yx2,x,y.

27 / 48

Proof

Step 1: We show that if f(x) is Lipschitz continuous with constant L>0, then f(y)f(x)+[f(x)](yx)+L2yx2,x,y.

Step 2: Plugging in y=xαf(x), then f(y)f(x)α[f(x)][f(x)]+α2L2f(x)2=f(x)(1αL/2)αf(x)2. This means that if α1/L, we have f(y)f(x)(α/2)f(x)2, implying that f is nonincreasing along iterations.

27 / 48

Proof

Step 3: Since f is convex, we have f(x)f(x)+[f(x)](xx),f(x)f(x)+[f(x)](xx).

28 / 48

Proof

Step 3: Since f is convex, we have f(x)f(x)+[f(x)](xx),f(x)f(x)+[f(x)](xx).

Step 4: Combining Step 2 and Step 3, f(y)f(x)+[f(x)](xx)α2f(x)2,f(y)f(x)12α(2α[f(x)](xx)α2f(x)2)=12α(xxαf(x)2+xx2)=12α(yx2+xx2).

28 / 48

Proof

Step 5: This means that f(x(k+1))f(x)12α(x(k)x2x(k+1)x2),i=0k1[f(x(i+1))f(x)]12α(x(0)x2x(k)x2)x(0)x22α.

29 / 48

Proof

Step 5: This means that f(x(k+1))f(x)12α(x(k)x2x(k+1)x2),i=0k1[f(x(i+1))f(x)]12α(x(0)x2x(k)x2)x(0)x22α.

Step 6: Step 2 shows that i=0k1[f(x(i+1))f(x)]k[f(x(k))f(x)], and then we get the final result.

29 / 48

Convergence with Strong Convexity [2]

If in addition, f is strongly convex with parameter m>0, then with a fixed step size α1/L, we have x(k)x2(1mα)kx(0)x2.

Note: Here the convergence of x(k)x implies the convergence of f(x(k))f(x). (Why?)

30 / 48

Convergence with Strong Convexity [2]

If in addition, f is strongly convex with parameter m>0, then with a fixed step size α1/L, we have x(k)x2(1mα)kx(0)x2.

Note: Here the convergence of x(k)x implies the convergence of f(x(k))f(x). (Why?)

f(x(k))f(x)+[f(x)](x(k)x)+L2x(k)x2=f(x)+L2x(k)x2.

30 / 48

Summary

  • If f is L-smooth (i.e., f(x) is Lipschitz continuous with constant L>0), then the optimization error, f(x(k))f(x), decays at the rate of O(1/k).

  • If f is L-smooth and m-strongly-convex, then the optimization error decays exponentially fast at the rate of O(ρk) for some ρ(0,1).

31 / 48

Line Search

The theory assumes that we use a fixed step size αkα. In practice, the step size can be selected adaptively using the line search scheme: at each iteration k, find the largest possible αk1 such that f(x(k)αkf(x(k)))f(x(k))βαkf(x(k))2, where 0<β1/2.

32 / 48

Constrained Optimization

  • Recall that GD solves an unconstrained and smooth convex optimization problem

  • However, many problems do not satisfy one or two of the conditions

  • We first consider the extension to constrained problems

33 / 48

Projected Gradient Descent

34 / 48

Constrained Optimization

Consider the constrained optimization problem minxC f(x), where C is a closed and non-empty convex set.

Now we assume that we can compute the projection operator PC(x) easily.

35 / 48

Projection Operator

The projection operator itself is also an optimization problem:

PC(x)=argminuC 12ux2.

However, for some convex sets C, computing PC(x) is trivial.

Projection is the closest point that satisfies the constraints.

36 / 48

Examples

  • If C={xRd:ax=b}, then PC(x)=x+(bax)a/a2.

  • If C={xRd:axb}, then PC(x)={x+(bax)a/a2,ax>bx,axb.

37 / 48

Examples

  • If C={xRd:lixiui,i=1,,d}, then PC(x)i={li,xilixi,li<xiuiui,xi>ui.

  • Special case: C={xRd:xi0}, PC(x)=[x]+.

  • Special case: C={xRd:xt}, PC(x)i={xi,|xi|tsign(xi)t,|xi|>t.

38 / 48

Examples

  • If C={xRd:xt}, then PC(x)={x,xttx/x,x>t.
39 / 48

Examples

  • If C={xRd:xt}, then PC(x)={x,xttx/x,x>t.

  • Question: How to compute PC(x), where C={xRd:x1t}? Is it easy or hard?

39 / 48

Examples

  • If C={xRd:xt}, then PC(x)={x,xttx/x,x>t.

  • Question: How to compute PC(x), where C={xRd:x1t}? Is it easy or hard?

  • Question: How to compute PC(X), where C={XRd:X0}, assuming X is symmetric? Is it easy or hard? (Use the Frobenius norm in the definition)

39 / 48

Non-expansive Property of Projection [3]

An interesting property of the projection operator is that it is non-expansive.

Suppose that C is a closed and non-empty convex set, then for any xC and yRd, (PC(y)x)(PC(y)y)0. This implies that xPC(y)xy.

40 / 48

Projected Gradient Descent

Projected gradient descent (PGD) modifies GD by adding a projection step at each iteration:

x(k+1)=PC(x(k)αkf(x(k))),k=0,1,.

Computational cost almost identical to GD if the projection operator is cheap.

41 / 48

Convergence Property [3]

Suppose that f:RdR is convex and L-smooth on C. Then using a fixed step size αkα=1/L, we have f(x(k))f(x)3Lx(0)x2+f(x(0))f(x)k+1, where f(x) is the optimal value.

42 / 48

Proof

Although the convergence result is similar to that of GD, the proof here is much harder.

We only show the proof of a key intermediate result, f(x(k+1))f(x(k))L2x(k+1)x(k)2.

A Complete proof can be found at https://www.stats.ox.ac.uk/~rebeschi/teaching/AFoL /20/material/lecture09.pdf

43 / 48

Proof

Step 1: Same as GD. f(z)f(x)+[f(x)](zx)+L2zx2,x,zC.

44 / 48

Proof

Step 1: Same as GD. f(z)f(x)+[f(x)](zx)+L2zx2,x,zC.

Step 2: Let y=xαf(x) and z=PC(y), and then we have f(x)=(1/α)(xy), and f(z)f(x)+α1(xy)(zx)+(L/2)zx2. Also note that (xy)(zx)=(zy)(zx)+(xz)(zx)=(zy)(zx)xz2.

44 / 48

Proof

Step 3: Since z=PC(y), by the non-expansive property we have (zx)(zx)0, and hence f(z)f(x)α1xz2+(L/2)zx2.

Take x=x(k), z=x(k+1), α=1/L, and we have f(x(k+1))f(x(k))L2x(k+1)x(k)2. This shows that f(x(k)) is nonincreasing.

45 / 48

Convergence with Strong Convexity [2]

If in addition, f is strongly convex with parameter m>0 on C, then with a fixed step size α=1/L, we have f(x(k))f(x)(1mL)k[f(x(0))f(x)].

46 / 48

Summary

  • PGD has the same convergence rates as GD in the following two cases.

  • If f is L-smooth, then the optimization error, f(x(k))f(x), decays at the rate of O(1/k).

  • If f is L-smooth and m-strongly-convex, then the optimization error decays exponentially fast at the rate of O(ρk), where ρ=1m/L.

47 / 48

References

[1] Stephen Boyd and Lieven Vandenberghe (2004). Convex optimization. Cambridge University Press.

[2] Robert M. Gower (2018). Convergence theorems for gradient descent. Lecture notes for Statistical Optimization.

[3] https://www.stats.ox.ac.uk/~rebeschi/teaching/AFoL/20/material/lecture09.pdf

48 / 48

Fundamental Problems

  • Numerical Linear Algebra

    • Computations on matrices and vectors
  • Optimization

    • Minimizing objective functions
  • Simulation/Sampling

    • Generating random numbers from statistical distributions
2 / 48
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow