Numerical Linear Algebra
Optimization
Simulation/Sampling
Optimization is a very broad topic
We can only cover a small subset in this course
Focus on convex optimization problems arising from statistical and machine learning models
We will introduce both classical and modern optimization algorithms
Also discuss convergence results that are less common in statistics courses (but common in optimization courses)
There is no globally "best" optimization algorithm
There may be good algorithms for specific problems
Convex functions
Gradient descent
Projected gradient descent
Convex set
Convex function
Strictly convex function
Strongly convex function
We only consider subsets of the Euclidean space Rd.
A set C⊆Rd is convex, if
tx+(1−t)y∈C,∀x,y∈C, 0≤t≤1.
The line segment connecting x and y is included in C.
f:Rd→R is convex, if its domain X is a convex set, and
f(tx+(1−t)y)≤tf(x)+(1−t)f(y),∀x,y∈X, 0≤t≤1.
The line segment between x and y lies above the function.
Similar to the definition of convex function, but
f(tx+(1−t)y)<tf(x)+(1−t)f(y),∀x≠y, 0<t<1.
Interpretation: The line segment between x and y strictly lies above the function, except for the end points x and y.
f is strongly convex with parameter m>0, if f−(m/2)∥x∥2 is convex.
Interpretation: f is at least as convex as a quadratic function.
Strongly convex ⇒ Strictly convex ⇒ Convex
f(y)≥f(x)+[∇f(x)]′(y−x),∀x,y∈X.
If f is twice differentiable and X is convex, then f is convex if and only if ∇2f(x)⪰0 for all x∈X.
If f is twice differentiable and X is convex, then f is strongly convex with parameter m>0 if and only if ∇2f(x)⪰mI for all x∈X.
Cα={x∈X:f(x)≤α},
is convex for any α.
A nonnegative weighted sum of convex functions f=w1f1+⋯+wmfm preserves convexity.
Composition with an affine mapping, g(x)=f(Ax+b), preserves convexity.
Pointwise maximum and supermum, f(x)=max{f1(x),…,fm(x)}, preserves convexity.
If f is convex and X is a random variable supported on C, then
f(E(X))≤E(f(X)).
Univariate functions
Linear functions: f(x)=Ax+b
Quadratic functions: f(x)=(1/2)x′Ax+b′x if A⪰0
Norms: ∥x∥ (any norm, not limited to Euclidean norm)
Log-sum-exp (LSE) function: LSE(x1,…,xn)=log(ex1+⋯+exn)
max(x1,…,xn)≤LSE(x1,…,xn)≤max(x1,…,xn)+log(n)
softmax(x1,…,xn)=(ex1∑iexi,…,exn∑iexi)
A convex optimization problem is concerned with finding some x∗ that attains the infimum infx∈Cf(x), where f:Rd→R is a convex function with domain X, and C⊆X is a convex set. Such an x∗, if it exists, is called a solution to the convex optimization problem.
In general, a convex optimization problem may have zero, one, or many solutions.
The optimal value is infx∈Cf(x)
A solution x∗ is also called an optimal point
The set of all optimal points is called the optimal set
The optimal set, if nonempty, is convex
If the objective function f is strictly convex, then the problem has at most one optimal point
If f is differentiable, then x∗ is an optimal point if and only if x∗∈C and [∇f(x∗)]′(y−x∗)≥0,∀y∈C.
If C=X, then x∗ is an optimal point if and only if ∇f(x∗)=0.
Gradient descent (GD) is a well-known and widely-used technique to solve the unconstrained, smooth convex optimization problem
minx f(x),
where f is convex and differentiable with domain X=Rd.
The algorithm is simple: given an initial value x(0), iterate
x(k+1)=x(k)−αk⋅∇f(x(k)),k=0,1,…,
where αk is the step size.
We are familiar with using GD to solve statistical models
But in this course we need to understand GD better
Does GD always converge?
How fast does it converge?
How to pick αk?
We first introduce the concept of Lipschitz continuity, which plays an important role in analyzing GD.
A function f:Rd→Rr is Lipschitz continuous with constant L>0, if ∥f(x)−f(y)∥≤L∥x−y∥,∀x,y∈Rd.
For example, linear functions are Lipschitz continuous, whereas quadratic functions are not.
Suppose that f:Rd→R is convex and differentiable, and assume its gradient is Lipschitz continuous with constant L>0, i.e., ∥∇f(x)−∇f(y)∥≤L∥x−y∥,∀x,y∈Rd. Then using a fixed step size αk≡α≤1/L, we have f(x(k))−f(x∗)≤∥x(0)−x∗∥22αk, where f(x∗) is the optimal value.
This is a non-asymptotic result.
Step 1: We show that if ∇f(x) is Lipschitz continuous with constant L>0, then f(y)≤f(x)+[∇f(x)]′(y−x)+L2∥y−x∥2,∀x,y.
Step 1: We show that if ∇f(x) is Lipschitz continuous with constant L>0, then f(y)≤f(x)+[∇f(x)]′(y−x)+L2∥y−x∥2,∀x,y.
Step 2: Plugging in y=x−α⋅∇f(x), then f(y)≤f(x)−α[∇f(x)]′[∇f(x)]+α2L2∥∇f(x)∥2=f(x)−(1−αL/2)α∥∇f(x)∥2. This means that if α≤1/L, we have f(y)≤f(x)−(α/2)∥∇f(x)∥2, implying that f is nonincreasing along iterations.
Step 3: Since f is convex, we have f(x∗)≥f(x)+[∇f(x)]′(x∗−x),f(x)≤f(x∗)+[∇f(x)]′(x−x∗).
Step 3: Since f is convex, we have f(x∗)≥f(x)+[∇f(x)]′(x∗−x),f(x)≤f(x∗)+[∇f(x)]′(x−x∗).
Step 4: Combining Step 2 and Step 3, f(y)≤f(x∗)+[∇f(x)]′(x−x∗)−α2∥∇f(x)∥2,f(y)−f(x∗)≤12α(2α[∇f(x)]′(x−x∗)−α2∥∇f(x)∥2)=12α(−∥x−x∗−α∇f(x)∥2+∥x−x∗∥2)=12α(−∥y−x∗∥2+∥x−x∗∥2).
Step 5: This means that f(x(k+1))−f(x∗)≤12α(∥x(k)−x∗∥2−∥x(k+1)−x∗∥2),k−1∑i=0[f(x(i+1))−f(x∗)]≤12α(∥x(0)−x∗∥2−∥x(k)−x∗∥2)≤∥x(0)−x∗∥22α.
Step 5: This means that f(x(k+1))−f(x∗)≤12α(∥x(k)−x∗∥2−∥x(k+1)−x∗∥2),k−1∑i=0[f(x(i+1))−f(x∗)]≤12α(∥x(0)−x∗∥2−∥x(k)−x∗∥2)≤∥x(0)−x∗∥22α.
Step 6: Step 2 shows that k−1∑i=0[f(x(i+1))−f(x∗)]≥k[f(x(k))−f(x∗)], and then we get the final result.
If in addition, f is strongly convex with parameter m>0, then with a fixed step size α≤1/L, we have ∥x(k)−x∗∥2≤(1−mα)k∥x(0)−x∗∥2.
Note: Here the convergence of ∥x(k)−x∗∥ implies the convergence of f(x(k))−f(x∗). (Why?)
If in addition, f is strongly convex with parameter m>0, then with a fixed step size α≤1/L, we have ∥x(k)−x∗∥2≤(1−mα)k∥x(0)−x∗∥2.
Note: Here the convergence of ∥x(k)−x∗∥ implies the convergence of f(x(k))−f(x∗). (Why?)
f(x(k))≤f(x∗)+[∇f(x∗)]′(x(k)−x∗)+L2∥x(k)−x∗∥2=f(x∗)+L2∥x(k)−x∗∥2.
If f is L-smooth (i.e., ∇f(x) is Lipschitz continuous with constant L>0), then the optimization error, f(x(k))−f(x∗), decays at the rate of O(1/k).
If f is L-smooth and m-strongly-convex, then the optimization error decays exponentially fast at the rate of O(ρk) for some ρ∈(0,1).
The theory assumes that we use a fixed step size αk≡α. In practice, the step size can be selected adaptively using the line search scheme: at each iteration k, find the largest possible αk≤1 such that f(x(k)−αk∇f(x(k)))≤f(x(k))−βαk∥∇f(x(k))∥2, where 0<β≤1/2.
Recall that GD solves an unconstrained and smooth convex optimization problem
However, many problems do not satisfy one or two of the conditions
We first consider the extension to constrained problems
Consider the constrained optimization problem minx∈C f(x), where C is a closed and non-empty convex set.
Now we assume that we can compute the projection operator PC(x) easily.
The projection operator itself is also an optimization problem:
PC(x)=argminu∈C 12∥u−x∥2.
However, for some convex sets C, computing PC(x) is trivial.
Projection is the closest point that satisfies the constraints.
If C={x∈Rd:a′x=b}, then PC(x)=x+(b−a′x)a/∥a∥2.
If C={x∈Rd:a′x≤b}, then PC(x)={x+(b−a′x)a/∥a∥2,a′x>bx,a′x≤b.
If C={x∈Rd:li≤xi≤ui,i=1,…,d}, then PC(x)i=⎧⎨⎩li,xi≤lixi,li<xi≤uiui,xi>ui.
Special case: C={x∈Rd:xi≥0}, PC(x)=[x]+.
Special case: C={x∈Rd:∥x∥∞≤t}, PC(x)i={xi,|xi|≤tsign(xi)t,|xi|>t.
If C={x∈Rd:∥x∥≤t}, then PC(x)={x,∥x∥≤ttx/∥x∥,∥x∥>t.
Question: How to compute PC(x), where C={x∈Rd:∥x∥1≤t}? Is it easy or hard?
If C={x∈Rd:∥x∥≤t}, then PC(x)={x,∥x∥≤ttx/∥x∥,∥x∥>t.
Question: How to compute PC(x), where C={x∈Rd:∥x∥1≤t}? Is it easy or hard?
Question: How to compute PC(X), where C={X∈Rd:X⪰0}, assuming X is symmetric? Is it easy or hard? (Use the Frobenius norm in the definition)
An interesting property of the projection operator is that it is non-expansive.
Suppose that C is a closed and non-empty convex set, then for any x∈C and y∈Rd, (PC(y)−x)′(PC(y)−y)≤0. This implies that ∥x−PC(y)∥≤∥x−y∥.
Projected gradient descent (PGD) modifies GD by adding a projection step at each iteration:
x(k+1)=PC(x(k)−αk⋅∇f(x(k))),k=0,1,….
Computational cost almost identical to GD if the projection operator is cheap.
Suppose that f:Rd→R is convex and L-smooth on C. Then using a fixed step size αk≡α=1/L, we have f(x(k))−f(x∗)≤3L∥x(0)−x∗∥2+f(x(0))−f(x∗)k+1, where f(x∗) is the optimal value.
Although the convergence result is similar to that of GD, the proof here is much harder.
We only show the proof of a key intermediate result, f(x(k+1))≤f(x(k))−L2∥x(k+1)−x(k)∥2.
A Complete proof can be found at https://www.stats.ox.ac.uk/~rebeschi/teaching/AFoL /20/material/lecture09.pdf
Step 1: Same as GD. f(z)≤f(x)+[∇f(x)]′(z−x)+L2∥z−x∥2,∀x,z∈C.
Step 1: Same as GD. f(z)≤f(x)+[∇f(x)]′(z−x)+L2∥z−x∥2,∀x,z∈C.
Step 2: Let y=x−α⋅∇f(x) and z=PC(y), and then we have ∇f(x)=(1/α)(x−y), and f(z)≤f(x)+α−1(x−y)′(z−x)+(L/2)∥z−x∥2. Also note that (x−y)′(z−x)=(z−y)′(z−x)+(x−z)′(z−x)=(z−y)′(z−x)−∥x−z∥2.
Step 3: Since z=PC(y), by the non-expansive property we have (z−x)′(z−x)≤0, and hence f(z)≤f(x)−α−1∥x−z∥2+(L/2)∥z−x∥2.
Take x=x(k), z=x(k+1), α=1/L, and we have f(x(k+1))≤f(x(k))−L2∥x(k+1)−x(k)∥2. This shows that f(x(k)) is nonincreasing.
If in addition, f is strongly convex with parameter m>0 on C, then with a fixed step size α=1/L, we have f(x(k))−f(x∗)≤(1−mL)k[f(x(0))−f(x∗)].
PGD has the same convergence rates as GD in the following two cases.
If f is L-smooth, then the optimization error, f(x(k))−f(x∗), decays at the rate of O(1/k).
If f is L-smooth and m-strongly-convex, then the optimization error decays exponentially fast at the rate of O(ρk), where ρ=1−m/L.
[1] Stephen Boyd and Lieven Vandenberghe (2004). Convex optimization. Cambridge University Press.
[2] Robert M. Gower (2018). Convergence theorems for gradient descent. Lecture notes for Statistical Optimization.
[3] https://www.stats.ox.ac.uk/~rebeschi/teaching/AFoL/20/material/lecture09.pdf
Numerical Linear Algebra
Optimization
Simulation/Sampling
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |