Computational Statistics
Lecture 9
Yixuan Qiu
2023-11-08
1 / 56

Optimization2 / 56

Last Time

We have spent much time on nonsmooth problems
Since they are very common in statistical models
Meanwhile they are considered "hard problems" in convex optimization
So far, the main tool to overcome nonsmoothness is the proximal operator
Seen in proximal gradient descent, Douglas-Rachford splitting, Davis-Yin splitting, etc.
In this lecture we introduce a few other frameworks

3 / 56

Today's Topics

Alternating direction method of multipliers
Coordinate descent
Case studies

4 / 56

Alternating Direction Method of Multipliers5 / 56

Problem

Another popular framework to deal with nonsmooth or constrained problems is the alternating direction method of multipliers (ADMM). It considers problems of the form:

, , , , and .
and are convex functions, possibly nonsmooth
represents general linear equality constraints

6 / 56

Examples

Although this problem form is a bit exotic compared with those we have seen in previous lectures
Many familiar problems can be converted into this form by cleverly defining the and variables
Lasso:

,
Linear constraint

7 / 56

Examples

Least absolute deviations:

,
Linear constraint

8 / 56

Algorithm

ADMM can be expressed as the following algorithm. Given initial values and , for , iterate

is an arbitray "inverse step size" parameter.

9 / 56

Some Remarks

The - and -updates look similar to proximal operators
However, here we have more general quadratic terms
This means that the - and -updates are typically more difficult than proximal operators
On the other hand, ADMM does not need to explicitly project to the linear constraint set
These findings largely guide how to define the and functions

10 / 56

Relation with DRS

Consider using ADMM to solve the problem

Then the iteration becomes

This is equivalent to DRS, by slightly reparameterizing the variables.

11 / 56

Implementation Details

Many implementation details can be found in the main reference [1]
Stopping criterion
Selection of the parameter
Closed-form expressions for specific problems

12 / 56

Convergence Property ^[2]

The sequence satisfy

This is one of the most general convergence results on ADMM.

Many other faster rates exist by imposing stronger conditions on and , e.g. smoothness, strong convexity, etc.

13 / 56

Coordinate Descent14 / 56

Motivation

One more idea to overcome nonsmoothness is to directly find the optimal point of nonsmooth problems
Of course, this is unrealistic for general problems
But in many cases this is possible for univariate functions

15 / 56

Example

Lasso model

Solving the whole vector is hard
However, assume we fix all elements of except
Then we can actually derive the optimal in closed form

16 / 56

Example

Recall that in Lecture 7 we have shown

View as a function of , then

is the -column of
is the matrix formed by removing from
is the vector formed by removing from

17 / 56

Example

gives the solution

18 / 56

Coordinate Descent

These findings motivate the coordinate descent (CD) algorithm. Consider the problem

One common type of CD algorithm proceeds as follows. Given initial value , for , iterate:

for within each -outer-iteration.

This is typically called the cyclic CD algorithm.

19 / 56

Variants

Another popular variant is to randomly select a coordinate to update. Given initial value , for , do:

Randomly select from with equal probabilities
Fixing other coordinates and update

In the literature, this method is sometimes called randomized CD.

20 / 56

Variants

Many other variants exist:

Permutation CD: for each , use a random permutation of to define the order of coordinate updates
Block CD: each time update a block of coordinates simultaneously, instead of a single coordinate
Coordinate proximal gradient descent: use proximal operators to update coordinates
...

21 / 56

Convergence Property ^[3]

Unfortunately, CD does not always converge to an optimal point for nonsmooth . Instead, consider the problem

where is convex and differentiable, and each is convex, possibly nonsmooth.

[3] shows that with very mild regularity conditions, the CD iterates converge to an optimal point of .

The result holds for block CD and arbitrary order of cycle through coordinates.

22 / 56

Convergence Property ^[4]

There exist many convergence results for different variants of CD based on different assumptions on . Below is one representative result with weak assumptions.

Assume that is coordinate-wise -smooth, and define where denotes the optimal set.

(See next slide)

23 / 56

Convergence Property ^[4]

Consider the randomized CD algorithm. Fix a confidence parameter and an accuracy level . Then with

we have with probability at least .

24 / 56

Convergence with Strong Convexity ^[4]

For simplicity suppose , and assume that is also strongly convex with parameter . Fix a confidence parameter and an accuracy level . If

then with probability at least .

25 / 56

Case Study26 / 56

Lasso

We use the lasso model to study the convergence property and computational efficiency of several algorithms we have covered in previous lectures.

Derive the update rules for several algorithms
Study the change of loss function versus iteration number
Study the change of loss function versus running time

27 / 56

Simulate Data Set

The design matrix is a simulated sparse matrix, but we tentatively use its dense version.

library(Matrix)
set.seed(123)
n = 500
p = 1000
s = 30
xsp = rsparsematrix(n, p, density = 0.1)
x = as.matrix(xsp)
beta = c(rnorm(s), numeric(p - s))
y = as.numeric(xsp %*% beta) + 0.1 * rnorm(n)
lam = 0.001
objfn = function(beta, x, y, lam)
{
    0.5 * mean((y - as.numeric(x %*% beta))^2) + lam * sum(abs(beta))
}

28 / 56

Compute Optimal Value

Use the celebrated glmnet package to compute the optimal value.

library(glmnet)
res = glmnet(x, y, lambda = lam, thresh = 1e-15,
             standardize = FALSE, intercept = FALSE)
beta_opt = as.numeric(res$beta)
head(beta_opt)

## [1] -0.3862279  1.4304118 -0.2206914  1.1256892  0.1704705  0.8657187

tail(beta_opt)

## [1]  0.000000000 -0.003938278  0.000000000  0.000000000  0.000000000
## [6]  0.008219107

obj_opt = objfn(beta_opt, x, y, lam)
obj_opt

## [1] 0.02782515

29 / 56

Optimization

A general interface for running optimization algorithms.

run_opt = function(x, y, lam, init_state, update_rule, maxit = 1001, ...)
{
    ...
}

init_state() and update_rule() are two function arguments to characterize the optimization algorithm
init_state(x, y, lam, ...) is used to create initial state of the algorithm
update_rule(beta, iter, x, y, lam, state) returns the updated and state, list(beta, state)

30 / 56

Optimization

A general interface for running optimization algorithms.

run_opt = function(x, y, lam, init_state, update_rule, maxit = 1001, ...)
{
    p = ncol(x)
    beta_hat = numeric(p)
    losses = c()
    # Initialization
    state = init_state(x, y, lam, ...)
    # Main loop
    for(i in 1:maxit)
    {
        res = update_rule(beta_hat, i - 1, x, y, lam, state)
        beta_hat = res$beta
        state = res$state
        loss = objfn(beta_hat, x, y, lam)
        losses = c(losses, loss)
        if((i - 1) %% 100 == 0)
            cat(sprintf("Iter %d, loss = %.6f\n", i - 1, loss))
    }
    list(beta_hat = beta_hat, losses = losses)
}

31 / 56

Subgradient Method

Choose
Update rule:
Step size

32 / 56

Implementation

init_subgrad = function(x, y, lam, alpha0 = 1)
{
    list(alpha0 = alpha0)
}
update_subgrad = function(beta, k, x, y, lam, state)
{
    n = nrow(x)
    alpha0 = state$alpha0
    alphak = alpha0 / sqrt(k + 1)
    Xtr = crossprod(x, as.numeric(x %*% beta) - y)
    subgrad = as.numeric(Xtr) / n + lam * sign(beta)
    new_beta = beta - alphak * subgrad
    list(beta = new_beta, state = state)
}

33 / 56

Result

res_subgrad1 = run_opt(
    x, y, lam, init_subgrad, update_subgrad, alpha0 = 1)

## Iter 0, loss = 0.770908
## Iter 100, loss = 0.084973
## Iter 200, loss = 0.074552
## Iter 300, loss = 0.070417
## Iter 400, loss = 0.067906
## Iter 500, loss = 0.066074
## Iter 600, loss = 0.064626
## Iter 700, loss = 0.063428
## Iter 800, loss = 0.062394
## Iter 900, loss = 0.061477
## Iter 1000, loss = 0.060661

head(res_subgrad1$beta)

## [1] -2.174641e-01  9.316010e-01 -1.026982e-01  6.878443e-01  1.129936e-05
## [6]  2.784390e-01

tail(res_subgrad1$beta)

## [1] -3.352589e-02 -5.879592e-02 -1.432275e-06 -3.051985e-02  5.442229e-02
## [6]  1.141592e-01

34 / 56

Result

res_subgrad10 = run_opt(
    x, y, lam, init_subgrad, update_subgrad, alpha0 = 10)

## Iter 0, loss = 8.881744
## Iter 100, loss = 0.049232
## Iter 200, loss = 0.040807
## Iter 300, loss = 0.036166
## Iter 400, loss = 0.032993
## Iter 500, loss = 0.030828
## Iter 600, loss = 0.029429
## Iter 700, loss = 0.028636
## Iter 800, loss = 0.028227
## Iter 900, loss = 0.028043
## Iter 1000, loss = 0.027961

head(res_subgrad10$beta)

## [1] -0.3913635  1.4284850 -0.2179803  1.1101840  0.1683580  0.8471952

tail(res_subgrad10$beta)

## [1] -2.979032e-04 -3.842055e-03  2.795830e-04 -9.758347e-05 -1.092878e-04
## [6]  1.166201e-02

35 / 56

Proximal Gradient Descent

Update rule:
Step size

36 / 56

Implementation

library(RSpectra)
# Soft-thresholding function
soft_thresh = function(x, a)
{
    sign(x) * pmax(0, abs(x) - a)
}
init_ista = function(x, y, lam)
{
    # Compute the largest eigenvalue of X'X
    alpha = nrow(x) / svds(x, k = 1, nu = 0, nv = 0)$d^2
    list(alpha = alpha)
}
update_ista = function(beta, k, x, y, lam, state)
{
    n = nrow(x)
    alpha = state$alpha
    Xtr = crossprod(x, as.numeric(x %*% beta) - y)
    grad = as.numeric(Xtr) / n
    new_beta = soft_thresh(beta - alpha * grad, alpha * lam)
    list(beta = new_beta, state = state)
}

37 / 56

Result

res_ista = run_opt(x, y, lam, init_ista, update_ista)

## Iter 0, loss = 0.494847
## Iter 100, loss = 0.044493
## Iter 200, loss = 0.032321
## Iter 300, loss = 0.028040
## Iter 400, loss = 0.027830
## Iter 500, loss = 0.027825
## Iter 600, loss = 0.027825
## Iter 700, loss = 0.027825
## Iter 800, loss = 0.027825
## Iter 900, loss = 0.027825
## Iter 1000, loss = 0.027825

head(res_ista$beta)

## [1] -0.3862280  1.4304114 -0.2206905  1.1256887  0.1704708  0.8657158

tail(res_ista$beta)

## [1]  0.000000000 -0.003938134  0.000000000  0.000000000  0.000000000
## [6]  0.008219716

38 / 56

Accelerated Proximal Gradient Descent

Update rule:

Step size

39 / 56

Implementation

init_fista = function(x, y, lam)
{
    p = ncol(x)
    alpha = nrow(x) / svds(x, k = 1, nu = 0, nv = 0)$d^2
    list(alpha = alpha, old_beta = numeric(p), betau = numeric(p))
}
update_fista = function(beta, k, x, y, lam, state)
{
    n = nrow(x)
    alpha = state$alpha
    state$betau = if(k <= 1) beta else
        beta + (k - 1) / (k + 2) * (beta - state$old_beta)
    Xtr = crossprod(x, as.numeric(x %*% state$betau) - y)
    grad = as.numeric(Xtr) / n
    new_beta = soft_thresh(state$betau - alpha * grad, alpha * lam)
    state$old_beta = beta
    list(beta = new_beta, state = state)
}

40 / 56

Result

res_fista = run_opt(x, y, lam, init_fista, update_fista)

## Iter 0, loss = 0.494847
## Iter 100, loss = 0.027830
## Iter 200, loss = 0.027825
## Iter 300, loss = 0.027825
## Iter 400, loss = 0.027825
## Iter 500, loss = 0.027825
## Iter 600, loss = 0.027825
## Iter 700, loss = 0.027825
## Iter 800, loss = 0.027825
## Iter 900, loss = 0.027825
## Iter 1000, loss = 0.027825

head(res_fista$beta)

## [1] -0.3862278  1.4304116 -0.2206913  1.1256895  0.1704706  0.8657180

tail(res_fista$beta)

## [1]  0.000000000 -0.003938369  0.000000000  0.000000000  0.000000000
## [6]  0.008219289

41 / 56

Douglas-Rachford Splitting

Let , . Then

Update rule:

where

42 / 56

Implementation

chol_solve = function(R, b)
{
    y = backsolve(R, b, transpose = TRUE)  # R'*y = b       
    x = backsolve(R, y)                    # R*x = y
    x
}
init_drs = function(x, y, lam, alpha = 1)
{
    n = nrow(x)
    p = ncol(x)
    scaled_xty = alpha / n * as.numeric(crossprod(x, y))
    R = chol(diag(p) + alpha / n * crossprod(x))
    list(alpha = alpha, sxty = scaled_xty,
         R = as.matrix(R), betau = numeric(p))
}
update_drs = function(beta, k, x, y, lam, state)
{
    n = nrow(x)
    alpha = state$alpha
    new_beta = soft_thresh(state$betau, alpha * lam)
    rhs = 2 * new_beta - state$betau + state$sxty
    state$betau = state$betau + chol_solve(state$R, rhs) - new_beta
    list(beta = new_beta, state = state)
}

43 / 56

Result

res_drs1 = run_opt(x, y, lam, init_drs, update_drs, alpha = 1)

## Iter 0, loss = 1.454710
## Iter 100, loss = 0.053128
## Iter 200, loss = 0.041675
## Iter 300, loss = 0.034461
## Iter 400, loss = 0.029864
## Iter 500, loss = 0.028097
## Iter 600, loss = 0.027848
## Iter 700, loss = 0.027828
## Iter 800, loss = 0.027826
## Iter 900, loss = 0.027825
## Iter 1000, loss = 0.027825

head(res_drs1$beta)

## [1] -0.3863738  1.4304559 -0.2205224  1.1254740  0.1704870  0.8652928

tail(res_drs1$beta)

## [1]  0.000000000 -0.003902763  0.000000000  0.000000000  0.000000000
## [6]  0.008261868

44 / 56

Result

res_drs10 = run_opt(x, y, lam, init_drs, update_drs, alpha = 10)

## Iter 0, loss = 1.454710
## Iter 100, loss = 0.027825
## Iter 200, loss = 0.027825
## Iter 300, loss = 0.027825
## Iter 400, loss = 0.027825
## Iter 500, loss = 0.027825
## Iter 600, loss = 0.027825
## Iter 700, loss = 0.027825
## Iter 800, loss = 0.027825
## Iter 900, loss = 0.027825
## Iter 1000, loss = 0.027825

head(res_drs10$beta)

## [1] -0.3862278  1.4304117 -0.2206913  1.1256895  0.1704706  0.8657182

tail(res_drs10$beta)

## [1]  0.000000000 -0.003938377  0.000000000  0.000000000  0.000000000
## [6]  0.008219268

45 / 56

ADMM

Let , . Then
Update rule:

where

A reparameterized DRS

46 / 56

Implementation

init_admm = function(x, y, lam, rho = 1)
{
    n = nrow(x)
    p = ncol(x)
    scaled_xty = as.numeric(crossprod(x, y)) / n / rho
    R = chol(diag(p) + crossprod(x) / n / rho)
    list(rho = rho, sxty = scaled_xty, R = as.matrix(R),
         betaz = numeric(p), betau = numeric(p))
}
update_admm = function(beta, k, x, y, lam, state)
{
    n = nrow(x)
    rho = state$rho
    new_beta = soft_thresh(state$betaz - state$betau, lam / rho)
    rhs = new_beta + state$betau + state$sxty
    state$betaz = chol_solve(state$R, rhs)
    state$betau = state$betau + new_beta - state$betaz
    list(beta = new_beta, state = state)
}

47 / 56

Result

res_admm1 = run_opt(x, y, lam, init_admm, update_admm, rho = 1)

## Iter 0, loss = 1.454710
## Iter 100, loss = 0.053098
## Iter 200, loss = 0.041664
## Iter 300, loss = 0.034453
## Iter 400, loss = 0.029859
## Iter 500, loss = 0.028096
## Iter 600, loss = 0.027848
## Iter 700, loss = 0.027828
## Iter 800, loss = 0.027826
## Iter 900, loss = 0.027825
## Iter 1000, loss = 0.027825

head(res_admm1$beta)

## [1] -0.3863737  1.4304559 -0.2205226  1.1254743  0.1704871  0.8652934

tail(res_admm1$beta)

## [1]  0.000000000 -0.003902788  0.000000000  0.000000000  0.000000000
## [6]  0.008261821

48 / 56

Result

res_admm0.1 = run_opt(x, y, lam, init_admm, update_admm, rho = 0.1)

## Iter 0, loss = 1.454710
## Iter 100, loss = 0.027825
## Iter 200, loss = 0.027825
## Iter 300, loss = 0.027825
## Iter 400, loss = 0.027825
## Iter 500, loss = 0.027825
## Iter 600, loss = 0.027825
## Iter 700, loss = 0.027825
## Iter 800, loss = 0.027825
## Iter 900, loss = 0.027825
## Iter 1000, loss = 0.027825

head(res_admm0.1$beta)

## [1] -0.3862278  1.4304117 -0.2206913  1.1256895  0.1704706  0.8657182

tail(res_admm0.1$beta)

## [1]  0.000000000 -0.003938377  0.000000000  0.000000000  0.000000000
## [6]  0.008219268

49 / 56

Coordinate Descent

Each iteration contains a full cycle of updates
Update rule:

A "cleverer" scheme is to note that

Then update

50 / 56

Implementation

# Soft-thresholding function
soft_thresh = function(x, a)
{
    sign(x) * pmax(0, abs(x) - a)
}
init_cd = function(x, y, lam)
{
    xi_square = colSums(x^2)
    thresh = nrow(x) * lam / xi_square
    list(xi_square = xi_square, thresh = thresh)
}
update_cd = function(beta, k, x, y, lam, state)
{
    new_beta = beta
    # betai <- S(betai + xi' * r / ||xi||^2, t), r = y - x * beta
    r = y - c(x %*% beta)
    for(i in seq_along(new_beta))
    {
        old = new_beta[i]
        newb = old + sum(x[, i] * r) / state$xi_square[i]
        thresh = state$thresh[i]
        new_beta[i] = soft_thresh(newb, thresh)
        r = r - sum(new_beta[i] - old) * x[, i]
    }
    list(beta = new_beta, state = state)
}

51 / 56

Result

res_cd = run_opt(x, y, lam, init_cd, update_cd)

## Iter 0, loss = 0.046708
## Iter 100, loss = 0.027825
## Iter 200, loss = 0.027825
## Iter 300, loss = 0.027825
## Iter 400, loss = 0.027825
## Iter 500, loss = 0.027825
## Iter 600, loss = 0.027825
## Iter 700, loss = 0.027825
## Iter 800, loss = 0.027825
## Iter 900, loss = 0.027825
## Iter 1000, loss = 0.027825

head(res_cd$beta)

## [1] -0.3862278  1.4304117 -0.2206913  1.1256895  0.1704706  0.8657182

tail(res_cd$beta)

## [1]  0.000000000 -0.003938377  0.000000000  0.000000000  0.000000000
## [6]  0.008219268

52 / 56

Comparison

K = 1000
gdat = data.frame(
    iter = rep(1:K, times = 8),
    loss = c(res_subgrad1$losses[1:K], res_subgrad10$losses[1:K],
             res_ista$losses[1:K], res_fista$losses[1:K],
             res_drs1$losses[1:K], res_drs10$losses[1:K],
             res_admm1$losses[1:K], res_cd$losses[1:K]),
    method = rep(c("Subgrad α=1", "Subgrad α=10", "ISTA", "FISTA",
                   "DRS α=1", "DRS α=10", "ADMM ρ=1", "CD"), each = K)
)

53 / 56

Comparison

library(ggplot2)
ggplot(gdat, aes(x = iter, y = log10(loss - obj_opt))) +
    geom_line(aes(color = method)) +
    scale_color_hue("Method") + xlab("Iteration") + ylab("Log-error") +
    theme_bw()

54 / 56

More Considerations

The previous comparison is neither rigorous nor complete, and more works need to be done:

The plot only shows error v.s. iteration, but not error v.s. computing time
The direct factorization of is not optimal (recall that )
Design of algorithms when is sparse (i.e., operating on the xsp object)

Left as after-class exercises.

55 / 56

References

[1] Stephen Boyd et al. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning.

[2] Deng Wei et al. (2017). Parallel multi-block ADMM with O(1/k) convergence. Journal of Scientific Computing.

[3] Paul Tseng (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications.

[4] Peter Richtárik and Martin Takáč (2014). Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming.

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Computational Statistics

Lecture 9

Yixuan Qiu

2023-11-08

Optimization

Last Time

Today's Topics

Alternating Direction Method of Multipliers

Problem

Examples

Examples

Algorithm

Some Remarks

Relation with DRS

Implementation Details

Convergence Property [2]

Coordinate Descent

Motivation

Example

Example

Example

Coordinate Descent

Variants

Variants

Convergence Property [3]

Convergence Property [4]

Convergence Property [4]

Convergence with Strong Convexity [4]

Case Study

Lasso

Simulate Data Set

Compute Optimal Value

Optimization

Optimization

Subgradient Method

Implementation

Result

Result

Proximal Gradient Descent

Implementation

Result

Accelerated Proximal Gradient Descent

Implementation

Result

Douglas-Rachford Splitting

Implementation

Result

Result

ADMM

Implementation

Result

Result

Coordinate Descent

Implementation

Result

Comparison

Comparison

More Considerations

References

Optimization

Help

Convergence Property ^[2]

Convergence Property ^[3]

Convergence Property ^[4]

Convergence Property ^[4]

Convergence with Strong Convexity ^[4]