Computational Statistics
Lecture 7
Yixuan Qiu
2023-10-25
1 / 41

Optimization2 / 41

Today's Topics

Subgradient and subdifferential
Subgradient methods
Proximal gradient descent

3 / 41

Last Time

The simplest gradient descent method deals with unconstrained and smooth problems
Projected gradient descent extends it to constrained problems
Today we discuss how to deal with nonsmooth problems

4 / 41

Subgradient

A subgradient of a convex function $f : R^{d} \to R$ at a point $x$ is any vector $g \in R^{d}$ such that

$f (y) \geq f (x) + g^{'} (y - x), \forall y \in X,$ where $X$ is the domain of $f$ .

5 / 41

Subgradient

A subgradient of a convex function $f : R^{d} \to R$ at a point $x$ is any vector $g \in R^{d}$ such that

$f (y) \geq f (x) + g^{'} (y - x), \forall y \in X,$ where $X$ is the domain of $f$ .

Recall that if $f$ is differentiable, then $f$ is convex if and only if

$f (y) \geq f (x) + [\nabla f (x)]^{'} (y - x), \forall x, y \in X .$

5 / 41

Subdifferential

The set of all subgradients of $f$ at $x$ is called the subdifferential:

$\partial f (x) = {g \in R^{d} : g is a subgradient of f at x} .$

(Image: https://optimization.mccormick.northwestern.edu/index.php/Subgradient_optimization)

6 / 41

Properties

Subgradient always exists for convex functions (subdifferential is nonempty)
$\partial f (x)$ is closed and convex
$f$ is differentiable at $x$ if and only if the subdifferential is a singleton containing the gradient: $\partial f (x) = {\nabla f (x)}$

7 / 41

Properties

Linear combination: $\partial (α_{1} f_{1} + α_{2} f_{2}) = α_{1} \cdot \partial f_{1} + α_{2} \cdot \partial f_{2}$ , where $α_{1}, α_{2} \geq 0$ , and for two sets $C_{1}, C_{2}$ , $α C_{1} + β C_{2} = {α x_{1} + β x_{2} : x_{1} \in C_{1}, x_{2} \in C_{2}}$
Affine transformation: If $g (x) = f (A x + b)$ , then $\partial g (x) = A^{'} \partial f (A x + b)$
Chain rule: Let $f$ be convex, $g$ be convex, differentiable, and nondecreasing. Denote by $h (x) = g (f (x))$ , and then $\partial h (x) = g^{'} (f (x)) \partial f (x)$

8 / 41

Examples

For $f = ‖ x ‖$ :

If $x \neq 0$ , $\partial f (x) = {x / ‖ x ‖}$
If $x = 0$ , $\partial f (x) = {z : ‖ z ‖ \leq 1}$

For $f = ‖ x ‖_{1}$ :

$\partial f (x) = {(g_{1}, \dots, g_{d})^{'} : g_{i} \in G_{i}}$ , where
If $x_{i} \neq 0$ , $G_{i} = {s i g n (x_{i})}$
If $x_{i} = 0$ , $G_{i} = [- 1, 1] = {z : | z | \leq 1}$

9 / 41

Examples ^[3,4]

For $f (X) = λ_{max} (X)$ defined on all symmetric $d \times d$ matrices, let $λ_{1} \geq \dots \geq λ_{d}$ be the eigenvalues and $γ_{1}, \dots, γ_{d}$ be the associated eigenvectors. Then

$\partial f (X) = {T \in K : (λ_{1} - λ_{i}) γ_{i}^{'} T γ_{i} = 0, i = 2, \dots, d},$ where $K = {T \in R^{d} : T^{'} = T, T ⪰ 0, t r (T) = 1}$ .

Special cases:

If $λ_{1} > λ_{2}$ , then $\partial f (X) = {\nabla f (X)} = {γ_{1} γ_{1}^{'}}$ .
If $λ_{1} = \dots = λ_{r} > λ_{r + 1}$ , then $\partial f (X) = {\sum_{i = 1}^{r} w_{i} γ_{i} γ_{i}^{'} : w \in R^{r}, w \geq 0, \sum_{i = 1}^{r} w_{i} = 1} .$

10 / 41

Relation to Optimality Condition

There is a very general rule: $x^{*}$ is an optimal point of $f (x)$ if and only if 0 is a subgradient of $f$ at $x^{*}$ :

$f (x^{*}) = min_{x} f (x) \Leftrightarrow 0 \in \partial f (x^{*}) .$

This is true even for nondifferentiable $f$ .

If $f$ is differentiable, then the condition reduces to $\nabla f (x^{*}) = 0.$

11 / 41

Subgradient Methods12 / 41

Subgradient Descent

Recall that gradient descent solves $min_{x} f (x),$ but it requires $f$ to be differentiable.

13 / 41

Subgradient Descent

Recall that gradient descent solves $min_{x} f (x),$ but it requires $f$ to be differentiable.

Subgradient descent loosens this condition by replacing the gradient with a subgradient: given an initial value $x^{(0)}$ , iterate

$x^{(k + 1)} = x^{(k)} - α_{k} \cdot g^{(k)}, k = 0, 1, \dots,$

where $α_{k}$ is the step size, and $g^{(k)} \in \partial f (x^{(k)})$ is any subgradient of $f$ at $x^{(k)}$ .

13 / 41

Projected Subgradient Descent

If the optimization problem is constrained on a closed and non-empty convex set $C$ , then similar to projected gradient descent, we can use the projected subgradient descent method:

$x^{(k + 1)} = P_{C} (x^{(k)} - α_{k} \cdot g^{(k)}), k = 0, 1, \dots .$

14 / 41

Questions

Does (projected) subgradient descent converge?
How fast does it converge?
How to pick $α_{k}$ ?

15 / 41

Convergence Analysis

Subgradient descent can be viewed as a special case of projected subgradient descent
Their convergence properties are similar
We show the results of projected subgradient descent for generality

16 / 41

A Fundamental Lemma

The projected subgradient descent iterations satisfy

$‖ x^{(k + 1)} - x^{*} ‖^{2} \leq ‖ x^{(k)} - x^{*} ‖^{2} - 2 α_{k} (f (x^{(k)}) - f (x^{*})) + α_{k}^{2} ‖ g^{(k)} ‖^{2} .$

17 / 41

A Fundamental Lemma

The projected subgradient descent iterations satisfy

$‖ x^{(k + 1)} - x^{*} ‖^{2} \leq ‖ x^{(k)} - x^{*} ‖^{2} - 2 α_{k} (f (x^{(k)}) - f (x^{*})) + α_{k}^{2} ‖ g^{(k)} ‖^{2} .$

Proof: $\begin{aligned} ‖ x^{(k + 1)} - x^{*} ‖^{2} & = ‖ P_{C} (x^{(k)} - α_{k} g^{(k)}) - x^{*} ‖^{2} \\ \leq ‖ x^{(k)} - α_{k} g^{(k)} - x^{*} ‖^{2} (projection is nonexpansive) \\ = ‖ x^{(k)} - x^{*} ‖^{2} - 2 α_{k} (x^{(k)} - x^{*})^{'} g^{(k)} + α_{k}^{2} ‖ g^{(k)} ‖^{2} \end{aligned}$

Subgradient satisfy $f (x^{*}) - f (x^{(k)}) \geq (x^{*} - x^{(k)})^{'} g^{(k)}$ , so the lemma holds.

17 / 41

Convergence Property

Suppose that $f : R^{d} \to R$ is convex and Lipschitz continuous on $C$ with constant $L$ . Then

$f (x_{best}^{(k)}) - f (x^{*}) \leq \frac{‖ x^{(0)} - x^{*} ‖^{2} + L^{2} \sum_{i = 1}^{k} α_{i}^{2}}{2 \sum_{i = 0}^{k} α_{i}} .$

For subgradient methods we can no longer guarantee the objective function values to be nonincreasing, so we select the best iterate $x_{best}^{(k)}$ , defined by

$f (x_{best}^{(k)}) = min_{i = 0, \dots, k} f (x^{(i)}) .$

18 / 41

Implications

With a fixed step size $α_{k} \equiv α$ , $f (x_{best}^{(k)}) - f (x^{*}) \leq \frac{‖ x^{(0)} - x^{*} ‖^{2}}{2 α k} + \frac{L^{2} α}{2} .$
With a diminishing step size $α_{k}$ that satisfies $\sum_{k} α_{k}^{2} < \infty$ and $\sum_{k} α_{k} = \infty$ , we have $f (x_{best}^{(k)}) \to f (x^{*})$ .
The "best" choice is $α_{k} = O (1 / \sqrt{k})$ , which gives $f (x_{best}^{(k)}) - f (x^{*}) \sim \frac{‖ x^{(0)} - x^{*} ‖^{2} + L^{2} \log (k)}{\sqrt{k}} .$

19 / 41

Proof

Applying the fundamental lemma recursively, and we have

$‖ x^{(k + 1)} - x^{*} ‖^{2} \leq ‖ x^{(0)} - x^{*} ‖^{2} - 2 \sum_{i = 0}^{k} α_{i} (f (x^{(i)}) - f (x^{*})) + \sum_{i = 0}^{k} α_{i}^{2} ‖ g^{(i)} ‖^{2} .$

Rearranging the terms gives

$\begin{aligned} 2 \sum_{i = 0}^{k} α_{i} (f (x^{(i)}) - f (x^{*})) & \leq ‖ x^{(0)} - x^{*} ‖^{2} - ‖ x^{(k + 1)} - x^{*} ‖^{2} + \sum_{i = 0}^{k} α_{i}^{2} ‖ g^{(i)} ‖^{2} \\ \leq ‖ x^{(0)} - x^{*} ‖^{2} + L^{2} \sum_{i = 0}^{k} α_{i}^{2} . \end{aligned}$

20 / 41

Convergence with Strong Convexity ^[5]

If in addition, $f$ is strongly convex with parameter $m > 0$ , then with a step size $α_{k} = 2 / (m (k + 1))$ , we have $f (x_{best}^{(k)}) - f (x^{*}) \leq \frac{2 L^{2}}{m} \cdot \frac{1}{k + 1} .$

21 / 41

Summary

Overall, subgradient methods have slower convergence rates than gradient descent counterparts.
If $f$ is Lipschitz continuous, then the optimization error decays at the rate of $O (1 / \sqrt{k})$ (gradient descent is $O (1 / k)$ ).
If $f$ is Lipschitz continuous and strongly convex, then the optimization error decays at the rate of $O (1 / k)$ (gradient descent is $O (ρ^{k})$ ).

22 / 41

Proximal Gradient Descent23 / 41

Motivation

Although subgradient methods are very simple and general in solving nonsmooth convex optimization problems
They have undesirable convergence speed
In many statistical models, the nonsmooth function we want to optimize has a special structure: $F (x) = f (x) + h (x),$ where $f$ is convex and smooth, and $h$ is convex but possibly nonsmooth

24 / 41

Examples

For the Lasso problem, $F (β) = f (β) + h (β)$ , where $f (β) = \frac{1}{2} ‖ y - X β ‖^{2}$ , and $h (β) = λ ‖ β ‖_{1}$
Constrained problems $min_{x \in C} f (x)$ can also be written as $F (x) = f (x) + h (x)$ , where $h (x) = I_{C} (x) = {\begin{cases} 0, & x \in C \\ \infty, & x \notin C \end{cases} .$

25 / 41

Proximal Gradient Descent

The proximal gradient descent algorithm solves $min_{x} F (x)$ using the following iteration scheme: given an initial value $x^{(0)}$ , iterate

$x^{(k + 1)} = {p r o x}_{α h} (x^{(k)} - α \cdot \nabla f (x^{(k)})), k = 0, 1, \dots,$

where ${p r o x}_{α h} (\cdot)$ is the proximal operator (defined later) of $h$ with step size $α$ .

26 / 41

Proximal Operator

The proximal operator of a convex function $h$ with step size $α$ is defined as

${p r o x}_{α h} (x) = \underset{u}{\arg min} h (u) + \frac{1}{2 α} ‖ u - x ‖^{2} .$

A Strongly convex optimization problem
But has closed form for many $h$ functions

27 / 41

Examples

If $h (x) = ‖ x ‖$ , then

${p r o x}_{α h} (x) = {\begin{cases} (1 - α / ‖ x ‖) x, & ‖ x ‖ \geq α \\ 0, & ‖ x ‖ < α \end{cases} .$

If $h (x) = (1 / 2) ‖ x ‖^{2}$ , then ${p r o x}_{α h} (x) = (1 + α)^{- 1} x .$
If $h (x) = (1 / 2) x^{'} A x + b^{'} x + c$ , where $A$ is positive definite, then

${p r o x}_{α h} (x) = (I + α A)^{- 1} (x - α b) .$

28 / 41

Examples

If $h (x) = ‖ x ‖_{1}$ , then ${p r o x}_{α h} (x) = S_{α} (x)$ , the soft-thresholding operator,

$(S_{α} (x))_{i} = {\begin{cases} x_{i} - α, & x_{i} > α \\ 0, & | x_{i} | \leq α \\ x_{i} + α, & x_{i} < - α \end{cases} .$

If $h (x) = I_{C} (x)$ , then ${p r o x}_{α h} (x) = P_{C} (x)$ , the projection operator.

29 / 41

Examples

If $h (x) = ‖ x ‖_{1}$ , then ${p r o x}_{α h} (x) = S_{α} (x)$ , the soft-thresholding operator,

$(S_{α} (x))_{i} = {\begin{cases} x_{i} - α, & x_{i} > α \\ 0, & | x_{i} | \leq α \\ x_{i} + α, & x_{i} < - α \end{cases} .$

If $h (x) = I_{C} (x)$ , then ${p r o x}_{α h} (x) = P_{C} (x)$ , the projection operator.
This means that proximal gradient descent reduces to projected gradient descent if $h (x) = I_{C} (x) .$

29 / 41

Properties ^[6]

If $h (x) = a \cdot g (x) + b$ , $a > 0$ , then ${p r o x}_{α h} (x) = {p r o x}_{a α g} (x)$
If $h (x) = g (a x + b)$ , $a \neq 0$ , then ${p r o x}_{α h} (x) = ({p r o x}_{a^{2} α g} (a x + b) - b) / a$
If $h (x) = g (x) + a^{'} x + b$ , then ${p r o x}_{α h} (x) = {p r o x}_{α g} (x - a α)$
If $h (x) = g (x) + (ρ / 2) ‖ x - a ‖^{2}$ , then ${p r o x}_{α h} (x) = {p r o x}_{\tilde{α} g} ((\tilde{α} / α) x - (ρ \tilde{α}) a)$ , where $\tilde{α} = α / (1 + α ρ)$

30 / 41

Convergence Property

We first present a nice property of the proximal gradient descent algorithm. The proof can be found at [7].

Suppose $f$ is convex and $L$ -smooth. Take the step size $α = 1 / L$ , and then

$F (x^{(k + 1)}) \leq F (x^{(k)}), ‖ x^{(k + 1)} - x^{*} ‖ \leq ‖ x^{(k)} - x^{*} ‖ .$

31 / 41

Convergence Property ^[7]

The main convergence theorem: Suppose $f$ is convex and $L$ -smooth. Take the step size $α = 1 / L$ , and then

$F (x^{(k)}) - F (x^{*}) \leq \frac{L ‖ x^{(0)} - x^{*} ‖^{2}}{2 k} .$

32 / 41

Convergence with Strong Convexity ^[7]

If in addition, $f$ is strongly convex with parameter $m > 0$ , then with a fixed step size $α = 1 / L$ , we have

$‖ x^{(k)} - x^{*} ‖^{2} \leq {(1 - \frac{m}{L})}^{k} ‖ x^{(0)} - x^{*} ‖^{2} .$

33 / 41

Summary

Proximal gradient descent matches the convergence rate of gradient descent.
Useful when ${p r o x}_{α h}$ can be efficiently evaluated.
If $f$ is $L$ -smooth, then the optimization error, $F (x^{(k)}) - F (x^{*})$ , decays at the rate of $O (1 / k)$ .
If $f$ is $L$ -smooth and $m$ -strongly-convex, then the optimization error decays exponentially fast at the rate of $O (ρ^{k})$ , where $ρ = 1 - m / L$ .

34 / 41

One More Thing...35 / 41

Nesterov Acceleration

It turns out that the proximal gradient descent algorithm can be accelerated almost for free, using a somewhat magical technique.

The accelerated version: given an initial value $x^{(0)} = x^{(- 1)}$ , iterate

$\begin{aligned} y^{(k + 1)} & = x^{(k)} + \frac{k - 1}{k + 2} (x^{(k)} - x^{(k - 1)}), \\ x^{(k + 1)} & = {p r o x}_{α h} (y^{(k + 1)} - α \cdot \nabla f (y^{(k + 1)})), k = 0, 1, \dots \end{aligned}$

36 / 41

Convergence Property ^[8]

The accelerated convergence rate: Suppose $f$ is convex and $L$ -smooth. Take the step size $α = 1 / L$ , and then

$F (x^{(k)}) - F (x^{*}) \leq \frac{2 L ‖ x^{(0)} - x^{*} ‖^{2}}{(k + 1)^{2}} .$

37 / 41

Convergence with Strong Convexity ^[8]

If in addition, $f$ is strongly convex with parameter $m > 0$ , then with $κ = L / m$ , the iteration becomes

$\begin{aligned} y^{(k + 1)} & = x^{(k)} + \frac{\sqrt{κ} - 1}{\sqrt{κ} + 1} (x^{(k)} - x^{(k - 1)}), \\ x^{(k + 1)} & = {p r o x}_{α h} (y^{(k + 1)} - α \cdot \nabla f (y^{(k + 1)})), k = 0, 1, \dots \end{aligned}$

With a fixed step size $α = 1 / L$ , we have

$F (x^{(k)}) - F (x^{*}) \leq {(1 - \frac{1}{\sqrt{κ}})}^{k} (F (x^{(0)}) - F (x^{*}) + \frac{m ‖ x^{(0)} - x^{*} ‖^{2}}{2}) .$

38 / 41

Summary

(Projected) subgradient descent is a general method to optimize a nonsmooth convex function
However, its convergence is typically slow
If the nonsmooth part has a simple proximal operator, then the proximal gradient descent algorithm is much preferred
Also get accelerations "for free"

39 / 41

References

[1] Stephen Boyd and Lieven Vandenberghe (2004). Convex optimization. Cambridge University Press.

[2] Robert M. Gower (2018). Convergence theorems for gradient descent. Lecture notes for Statistical Optimization.

[3] Claude Vallee, Danielle Fortune, and Camelia Lerintiu (2008). Subdifferential of the Largest Eigenvalue of a Symmetrical Matrix Application of Direct Projection Methods. Analysis and Applications.

[4] Adrian S. Lewis (1999). Nonsmooth analysis of eigenvalues. Mathematical Programming.

[5] https://yuxinchen2020.github.io/ele522_optimization/lectures /subgradient_methods.pdf

[6] Neal Parikh and Stephen Boyd (2014). Proximal algorithms. Foundations and trends® in Optimization.

40 / 41

References

[7] https://yuxinchen2020.github.io/ele522_optimization/lectures /proximal_gradient.pdf

[8] https://yuxinchen2020.github.io/ele522_optimization/lectures /accelerated_gradient.pdf

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Computational Statistics

Lecture 7

Yixuan Qiu

2023-10-25

Optimization

Today's Topics

Last Time

Subgradient

Subgradient

Subdifferential

Properties

Properties

Examples

Examples [3,4]

Relation to Optimality Condition

Subgradient Methods

Subgradient Descent

Subgradient Descent

Projected Subgradient Descent

Questions

Convergence Analysis

A Fundamental Lemma

A Fundamental Lemma

Convergence Property

Implications

Proof

Convergence with Strong Convexity [5]

Summary

Proximal Gradient Descent

Motivation

Examples

Proximal Gradient Descent

Proximal Operator

Examples

Examples

Examples

Properties [6]

Convergence Property

Convergence Property [7]

Convergence with Strong Convexity [7]

Summary

One More Thing...

Nesterov Acceleration

Convergence Property [8]

Convergence with Strong Convexity [8]

Summary

References

References

Optimization

Help

Examples ^[3,4]

Convergence with Strong Convexity ^[5]

Properties ^[6]

Convergence Property ^[7]

Convergence with Strong Convexity ^[7]

Convergence Property ^[8]

Convergence with Strong Convexity ^[8]