Computational Statistics
Lecture 6
Yixuan Qiu
2022-10-19
1 / 48

Fundamental Problems

Numerical Linear Algebra
- Computations on matrices and vectors
Optimization
- Minimizing objective functions
Simulation/Sampling
- Generating random numbers from statistical distributions

2 / 48

Optimization3 / 48

Scope

Optimization is a very broad topic
We can only cover a small subset in this course
Focus on convex optimization problems arising from statistical and machine learning models

4 / 48

Scope

We will introduce both classical and modern optimization algorithms
Also discuss convergence results that are less common in statistics courses (but common in optimization courses)
There is no globally "best" optimization algorithm
There may be good algorithms for specific problems

5 / 48

Today's Topics

Convex functions
Gradient descent
Projected gradient descent

6 / 48

Concepts

Convex set
Convex function
Strictly convex function
Strongly convex function

7 / 48

Convex Set

We only consider subsets of the Euclidean space $R^{d}$ .

A set $C \subseteq R^{d}$ is convex, if

$t x + (1 - t) y \in C, \forall x, y \in C, 0 \leq t \leq 1.$

(Image: https://en.wikipedia.org/wiki/Convex_set)

The line segment connecting $x$ and $y$ is included in $C$ .

8 / 48

Convex Function

$f : R^{d} \to R$ is convex, if its domain $X$ is a convex set, and

$f (t x + (1 - t) y) \leq t f (x) + (1 - t) f (y), \forall x, y \in X, 0 \leq t \leq 1.$

(Image: https://en.wikipedia.org/wiki/Convex_function)

The line segment between $x$ and $y$ lies above the function.

9 / 48

Strictly Convex Function

Similar to the definition of convex function, but

$f (t x + (1 - t) y) < t f (x) + (1 - t) f (y), \forall x \neq y, 0 < t < 1.$

Interpretation: The line segment between $x$ and $y$ strictly lies above the function, except for the end points $x$ and $y$ .

10 / 48

Strongly Convex Function

$f$ is strongly convex with parameter $m > 0$ , if $f - (m / 2) ‖ x ‖^{2}$ is convex.

Interpretation: $f$ is at least as convex as a quadratic function.

11 / 48

Relation

Strongly convex $\Rightarrow$ Strictly convex $\Rightarrow$ Convex

12 / 48

Properties

If $f$ is differentiable and $X$ , the domain of $f$ , is convex, then $f$ is convex if and only if

$f (y) \geq f (x) + [\nabla f (x)]^{'} (y - x), \forall x, y \in X .$

If $f$ is twice differentiable and $X$ is convex, then $f$ is convex if and only if $\nabla^{2} f (x) ⪰ 0$ for all $x \in X$ .
If $f$ is twice differentiable and $X$ is convex, then $f$ is strongly convex with parameter $m > 0$ if and only if $\nabla^{2} f (x) ⪰ m I$ for all $x \in X$ .

13 / 48

Properties

The $α$ -sublevel set of a convex function $f$ , defined as

$C_{α} = {x \in X : f (x) \leq α},$

is convex for any $α$ .

The converse is not true! Example: $f (x) = - e^{x}$ .

14 / 48

Properties

A nonnegative weighted sum of convex functions $f = w_{1} f_{1} + \dots + w_{m} f_{m}$ preserves convexity.
Composition with an affine mapping, $g (x) = f (A x + b)$ , preserves convexity.
Pointwise maximum and supermum, $f (x) = max {f_{1} (x), \dots, f_{m} (x)}$ , preserves convexity.

15 / 48

Jensen's Inequality

If $f$ is convex and $X$ is a random variable supported on $C$ , then

$f (E (X)) \leq E (f (X)) .$

16 / 48

Examples

Univariate functions
- $e^{a x}$ for any $a \in R$
- $x^{a}$ for $a \geq 1$ or $a \leq 0$ with $X = R_{+}$
- $- x^{a}$ for $0 \leq a \leq 1$ with $X = R_{+}$
- $| x |^{a}$ for $a \geq 1$ with $X = R$
Linear functions: $f (x) = A x + b$
Quadratic functions: $f (x) = (1 / 2) x^{'} A x + b^{'} x$ if $A ⪰ 0$
Norms: $‖ x ‖$ (any norm, not limited to Euclidean norm)
Log-sum-exp (LSE) function: $L S E (x_{1}, \dots, x_{n}) = \log (e^{x_{1}} + \dots + e^{x_{n}})$

17 / 48

A Side Note on LSE

In some sense LSE can be called the "soft max" function, since

$max (x_{1}, \dots, x_{n}) \leq L S E (x_{1}, \dots, x_{n}) \leq max (x_{1}, \dots, x_{n}) + \log (n)$

The widely-used Softmax function should better be called "soft argmax"

$s o f t m a x (x_{1}, \dots, x_{n}) = (\frac{e^{x_{1}}}{\sum_{i} e^{x_{i}}}, \dots, \frac{e^{x_{n}}}{\sum_{i} e^{x_{i}}})$

It can be verified that Softmax is the gradient of LSE

18 / 48

Convex Optimization Problem

A convex optimization problem is concerned with finding some $x^{*}$ that attains the infimum $inf_{x \in C} f (x),$ where $f : R^{d} \to R$ is a convex function with domain $X$ , and $C \subseteq X$ is a convex set. Such an $x^{*}$ , if it exists, is called a solution to the convex optimization problem.

In general, a convex optimization problem may have zero, one, or many solutions.

19 / 48

Concepts

The optimal value is $inf_{x \in C} f (x)$
A solution $x^{*}$ is also called an optimal point
The set of all optimal points is called the optimal set

20 / 48

Properties

The optimal set, if nonempty, is convex
If the objective function $f$ is strictly convex, then the problem has at most one optimal point
If $f$ is differentiable, then $x^{*}$ is an optimal point if and only if $x^{*} \in C$ and $[\nabla f (x^{*})]^{'} (y - x^{*}) \geq 0, \forall y \in C .$
If $C = X$ , then $x^{*}$ is an optimal point if and only if $\nabla f (x^{*}) = 0$ .

21 / 48

Gradient Descent22 / 48

Gradient Descent

Gradient descent (GD) is a well-known and widely-used technique to solve the unconstrained, smooth convex optimization problem

$min_{x} f (x),$

where $f$ is convex and differentiable with domain $X = R^{d}$ .

The algorithm is simple: given an initial value $x^{(0)}$ , iterate

$x^{(k + 1)} = x^{(k)} - α_{k} \cdot \nabla f (x^{(k)}), k = 0, 1, \dots,$

where $α_{k}$ is the step size.

23 / 48

Questions

We are familiar with using GD to solve statistical models
But in this course we need to understand GD better
Does GD always converge?
How fast does it converge?
How to pick $α_{k}$ ?

24 / 48

Lipschitz Continuity

We first introduce the concept of Lipschitz continuity, which plays an important role in analyzing GD.

A function $f : R^{d} \to R^{r}$ is Lipschitz continuous with constant $L > 0$ , if $‖ f (x) - f (y) ‖ \leq L ‖ x - y ‖, \forall x, y \in R^{d} .$

For example, linear functions are Lipschitz continuous, whereas quadratic functions are not.

25 / 48

Convergence of GD

Suppose that $f : R^{d} \to R$ is convex and differentiable, and assume its gradient is Lipschitz continuous with constant $L > 0$ , i.e., $‖ \nabla f (x) - \nabla f (y) ‖ \leq L ‖ x - y ‖, \forall x, y \in R^{d} .$ Then using a fixed step size $α_{k} \equiv α \leq 1 / L$ , we have $f (x^{(k)}) - f (x^{*}) \leq \frac{‖ x^{(0)} - x^{*} ‖^{2}}{2 α k},$ where $f (x^{*})$ is the optimal value.

This is a non-asymptotic result.

26 / 48

Proof

Step 1: We show that if $\nabla f (x)$ is Lipschitz continuous with constant $L > 0$ , then $f (y) \leq f (x) + [\nabla f (x)]^{'} (y - x) + \frac{L}{2} ‖ y - x ‖^{2}, \forall x, y .$

27 / 48

Proof

Step 1: We show that if $\nabla f (x)$ is Lipschitz continuous with constant $L > 0$ , then $f (y) \leq f (x) + [\nabla f (x)]^{'} (y - x) + \frac{L}{2} ‖ y - x ‖^{2}, \forall x, y .$

Step 2: Plugging in $y = x - α \cdot \nabla f (x)$ , then $\begin{aligned} f (y) & \leq f (x) - α [\nabla f (x)]^{'} [\nabla f (x)] + \frac{α^{2} L}{2} ‖ \nabla f (x) ‖^{2} \\ = f (x) - (1 - α L / 2) α ‖ \nabla f (x) ‖^{2} . \end{aligned}$ This means that if $α \leq 1 / L$ , we have $f (y) \leq f (x) - (α / 2) ‖ \nabla f (x) ‖^{2}$ , implying that $f$ is nonincreasing along iterations.

27 / 48

Proof

Step 3: Since $f$ is convex, we have $\begin{aligned} f (x^{*}) & \geq f (x) + [\nabla f (x)]^{'} (x^{*} - x), \\ f (x) & \leq f (x^{*}) + [\nabla f (x)]^{'} (x - x^{*}) . \end{aligned}$

28 / 48

Proof

Step 3: Since $f$ is convex, we have $\begin{aligned} f (x^{*}) & \geq f (x) + [\nabla f (x)]^{'} (x^{*} - x), \\ f (x) & \leq f (x^{*}) + [\nabla f (x)]^{'} (x - x^{*}) . \end{aligned}$

Step 4: Combining Step 2 and Step 3, $\begin{aligned} f (y) & \leq f (x^{*}) + [\nabla f (x)]^{'} (x - x^{*}) - \frac{α}{2} ‖ \nabla f (x) ‖^{2}, \\ f (y) - f (x^{*}) & \leq \frac{1}{2 α} (2 α [\nabla f (x)]^{'} (x - x^{*}) - α^{2} ‖ \nabla f (x) ‖^{2}) \\ = \frac{1}{2 α} (- ‖ x - x^{*} - α \nabla f (x) ‖^{2} + ‖ x - x^{*} ‖^{2}) \\ = \frac{1}{2 α} (- ‖ y - x^{*} ‖^{2} + ‖ x - x^{*} ‖^{2}) . \end{aligned}$

28 / 48

Proof

Step 5: This means that $\begin{aligned} f (x^{(k + 1)}) - f (x^{*}) & \leq \frac{1}{2 α} (‖ x^{(k)} - x^{*} ‖^{2} - ‖ x^{(k + 1)} - x^{*} ‖^{2}), \\ \sum_{i = 0}^{k - 1} [f (x^{(i + 1)}) - f (x^{*})] & \leq \frac{1}{2 α} (‖ x^{(0)} - x^{*} ‖^{2} - ‖ x^{(k)} - x^{*} ‖^{2}) \\ \leq \frac{‖ x^{(0)} - x^{*} ‖^{2}}{2 α} . \end{aligned}$

29 / 48

Proof

Step 6: Step 2 shows that $\sum_{i = 0}^{k - 1} [f (x^{(i + 1)}) - f (x^{*})] \geq k [f (x^{(k)}) - f (x^{*})],$ and then we get the final result.

29 / 48

Convergence with Strong Convexity ^[2]

If in addition, $f$ is strongly convex with parameter $m > 0$ , then with a fixed step size $α \leq 1 / L$ , we have $‖ x^{(k)} - x^{*} ‖^{2} \leq (1 - m α)^{k} ‖ x^{(0)} - x^{*} ‖^{2} .$

Note: Here the convergence of $‖ x^{(k)} - x^{*} ‖$ implies the convergence of $f (x^{(k)}) - f (x^{*})$ . (Why?)

30 / 48

Convergence with Strong Convexity ^[2]

If in addition, $f$ is strongly convex with parameter $m > 0$ , then with a fixed step size $α \leq 1 / L$ , we have $‖ x^{(k)} - x^{*} ‖^{2} \leq (1 - m α)^{k} ‖ x^{(0)} - x^{*} ‖^{2} .$

Note: Here the convergence of $‖ x^{(k)} - x^{*} ‖$ implies the convergence of $f (x^{(k)}) - f (x^{*})$ . (Why?)

$\begin{aligned} f (x^{(k)}) & \leq f (x^{*}) + [\nabla f (x^{*})]^{'} (x^{(k)} - x^{*}) + \frac{L}{2} ‖ x^{(k)} - x^{*} ‖^{2} \\ = f (x^{*}) + \frac{L}{2} ‖ x^{(k)} - x^{*} ‖^{2} . \end{aligned}$

30 / 48

Summary

If $f$ is $L$ -smooth (i.e., $\nabla f (x)$ is Lipschitz continuous with constant $L > 0$ ), then the optimization error, $f (x^{(k)}) - f (x^{*})$ , decays at the rate of $O (1 / k)$ .
If $f$ is $L$ -smooth and $m$ -strongly-convex, then the optimization error decays exponentially fast at the rate of $O (ρ^{k})$ for some $ρ \in (0, 1)$ .

31 / 48

Line Search

The theory assumes that we use a fixed step size $α_{k} \equiv α$ . In practice, the step size can be selected adaptively using the line search scheme: at each iteration $k$ , find the largest possible $α_{k} \leq 1$ such that $f (x^{(k)} - α_{k} \nabla f (x^{(k)})) \leq f (x^{(k)}) - β α_{k} ‖ \nabla f (x^{(k)}) ‖^{2},$ where $0 < β \leq 1 / 2$ .

32 / 48

Constrained Optimization

Recall that GD solves an unconstrained and smooth convex optimization problem
However, many problems do not satisfy one or two of the conditions
We first consider the extension to constrained problems

33 / 48

Projected Gradient Descent34 / 48

Constrained Optimization

Consider the constrained optimization problem $min_{x \in C} f (x),$ where $C$ is a closed and non-empty convex set.

Now we assume that we can compute the projection operator $P_{C} (x)$ easily.

35 / 48

Projection Operator

The projection operator itself is also an optimization problem:

$P_{C} (x) = \underset{u \in C}{\arg min} \frac{1}{2} ‖ u - x ‖^{2} .$

However, for some convex sets $C$ , computing $P_{C} (x)$ is trivial.

Projection is the closest point that satisfies the constraints.

36 / 48

Examples

If $C = {x \in R^{d} : a^{'} x = b}$ , then $P_{C} (x) = x + (b - a^{'} x) a / ‖ a ‖^{2} .$
If $C = {x \in R^{d} : a^{'} x \leq b}$ , then $P_{C} (x) = {\begin{cases} x + (b - a^{'} x) a / ‖ a ‖^{2}, & a^{'} x > b \\ x, & a^{'} x \leq b \end{cases} .$

37 / 48

Examples

If $C = {x \in R^{d} : l_{i} \leq x_{i} \leq u_{i}, i = 1, \dots, d}$ , then $P_{C} (x)_{i} = {\begin{cases} l_{i}, & x_{i} \leq l_{i} \\ x_{i}, & l_{i} < x_{i} \leq u_{i} \\ u_{i}, & x_{i} > u_{i} \end{cases} .$
Special case: $C = {x \in R^{d} : x_{i} \geq 0}$ , $P_{C} (x) = [x]_{+}$ .
Special case: $C = {x \in R^{d} : ‖ x ‖_{\infty} \leq t}$ , $P_{C} (x)_{i} = {\begin{cases} x_{i}, & | x_{i} | \leq t \\ s i g n (x_{i}) t, & | x_{i} | > t \end{cases} .$

38 / 48

ExamplesIf C={x∈Rd:∥x∥≤t}C={x∈Rd:‖x‖≤t}, then
PC(x)={x,∥x∥≤ttx/∥x∥,∥x∥>t.PC(x)={x,‖x‖≤ttx/‖x‖,‖x‖>t.
39 / 48

Examples

If $C = {x \in R^{d} : ‖ x ‖ \leq t}$ , then $P_{C} (x) = {\begin{cases} x, & ‖ x ‖ \leq t \\ t x / ‖ x ‖, & ‖ x ‖ > t \end{cases} .$
Question: How to compute $P_{C} (x)$ , where $C = {x \in R^{d} : ‖ x ‖_{1} \leq t}$ ? Is it easy or hard?

39 / 48

Examples

If $C = {x \in R^{d} : ‖ x ‖ \leq t}$ , then $P_{C} (x) = {\begin{cases} x, & ‖ x ‖ \leq t \\ t x / ‖ x ‖, & ‖ x ‖ > t \end{cases} .$
Question: How to compute $P_{C} (x)$ , where $C = {x \in R^{d} : ‖ x ‖_{1} \leq t}$ ? Is it easy or hard?
Question: How to compute $P_{C} (X)$ , where $C = {X \in R^{d} : X ⪰ 0}$ , assuming $X$ is symmetric? Is it easy or hard? (Use the Frobenius norm in the definition)

39 / 48

Non-expansive Property of Projection ^[3]

An interesting property of the projection operator is that it is non-expansive.

Suppose that $C$ is a closed and non-empty convex set, then for any $x \in C$ and $y \in R^{d}$ , $(P_{C} (y) - x)^{'} (P_{C} (y) - y) \leq 0.$ This implies that $‖ x - P_{C} (y) ‖ \leq ‖ x - y ‖ .$

40 / 48

Projected Gradient Descent

Projected gradient descent (PGD) modifies GD by adding a projection step at each iteration:

$x^{(k + 1)} = P_{C} (x^{(k)} - α_{k} \cdot \nabla f (x^{(k)})), k = 0, 1, \dots .$

Computational cost almost identical to GD if the projection operator is cheap.

41 / 48

Convergence Property ^[3]

Suppose that $f : R^{d} \to R$ is convex and $L$ -smooth on $C$ . Then using a fixed step size $α_{k} \equiv α = 1 / L$ , we have $f (x^{(k)}) - f (x^{*}) \leq \frac{3 L ‖ x^{(0)} - x^{*} ‖^{2} + f (x^{(0)}) - f (x^{*})}{k + 1},$ where $f (x^{*})$ is the optimal value.

42 / 48

Proof

Although the convergence result is similar to that of GD, the proof here is much harder.

We only show the proof of a key intermediate result, $f (x^{(k + 1)}) \leq f (x^{(k)}) - \frac{L}{2} ‖ x^{(k + 1)} - x^{(k)} ‖^{2} .$

A Complete proof can be found at https://www.stats.ox.ac.uk/~rebeschi/teaching/AFoL /20/material/lecture09.pdf

43 / 48

Proof

Step 1: Same as GD. $f (z) \leq f (x) + [\nabla f (x)]^{'} (z - x) + \frac{L}{2} ‖ z - x ‖^{2}, \forall x, z \in C .$

44 / 48

Proof

Step 1: Same as GD. $f (z) \leq f (x) + [\nabla f (x)]^{'} (z - x) + \frac{L}{2} ‖ z - x ‖^{2}, \forall x, z \in C .$

Step 2: Let $y = x - α \cdot \nabla f (x)$ and $z = P_{C} (y)$ , and then we have $\nabla f (x) = (1 / α) (x - y)$ , and $f (z) \leq f (x) + α^{- 1} (x - y)^{'} (z - x) + (L / 2) ‖ z - x ‖^{2} .$ Also note that $\begin{aligned} (x - y)^{'} (z - x) & = (z - y)^{'} (z - x) + (x - z)^{'} (z - x) \\ = (z - y)^{'} (z - x) - ‖ x - z ‖^{2} . \end{aligned}$

44 / 48

Proof

Step 3: Since $z = P_{C} (y)$ , by the non-expansive property we have $(z - x)^{'} (z - x) \leq 0,$ and hence $f (z) \leq f (x) - α^{- 1} ‖ x - z ‖^{2} + (L / 2) ‖ z - x ‖^{2} .$

Take $x = x^{(k)}$ , $z = x^{(k + 1)}$ , $α = 1 / L$ , and we have $f (x^{(k + 1)}) \leq f (x^{(k)}) - \frac{L}{2} ‖ x^{(k + 1)} - x^{(k)} ‖^{2} .$ This shows that $f (x^{(k)})$ is nonincreasing.

45 / 48

Convergence with Strong Convexity ^[2]

If in addition, $f$ is strongly convex with parameter $m > 0$ on $C$ , then with a fixed step size $α = 1 / L$ , we have $f (x^{(k)}) - f (x^{*}) \leq {(1 - \frac{m}{L})}^{k} [f (x^{(0)}) - f (x^{*})] .$

46 / 48

Summary

PGD has the same convergence rates as GD in the following two cases.
If $f$ is $L$ -smooth, then the optimization error, $f (x^{(k)}) - f (x^{*})$ , decays at the rate of $O (1 / k)$ .
If $f$ is $L$ -smooth and $m$ -strongly-convex, then the optimization error decays exponentially fast at the rate of $O (ρ^{k})$ , where $ρ = 1 - m / L$ .

47 / 48

References

[1] Stephen Boyd and Lieven Vandenberghe (2004). Convex optimization. Cambridge University Press.

[2] Robert M. Gower (2018). Convergence theorems for gradient descent. Lecture notes for Statistical Optimization.

[3] https://www.stats.ox.ac.uk/~rebeschi/teaching/AFoL /20/material/lecture09.pdf

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Computational Statistics

Lecture 6

Yixuan Qiu

2022-10-19

Fundamental Problems

Optimization

Scope

Scope

Today's Topics

Concepts

Convex Set

Convex Function

Strictly Convex Function

Strongly Convex Function

Relation

Properties

Properties

Properties

Jensen's Inequality

Examples

A Side Note on LSE

Convex Optimization Problem

Concepts

Properties

Gradient Descent

Gradient Descent

Questions

Lipschitz Continuity

Convergence of GD

Proof

Proof

Proof

Proof

Proof

Proof

Convergence with Strong Convexity [2]

Convergence with Strong Convexity [2]

Summary

Line Search

Constrained Optimization

Projected Gradient Descent

Constrained Optimization

Projection Operator

Examples

Examples

Examples

Examples

Examples

Non-expansive Property of Projection [3]

Projected Gradient Descent

Convergence Property [3]

Proof

Proof

Proof

Proof

Convergence with Strong Convexity [2]

Summary

References

Fundamental Problems

Help

Convergence with Strong Convexity ^[2]

Convergence with Strong Convexity ^[2]

Non-expansive Property of Projection ^[3]

Convergence Property ^[3]

Convergence with Strong Convexity ^[2]