We have spent much time on nonsmooth problems
Since they are very common in statistical models
Meanwhile they are considered "hard problems" in convex optimization
So far, the main tool to overcome nonsmoothness is the proximal operator
Seen in proximal gradient descent, Douglas-Rachford splitting, Davis-Yin splitting, etc.
In this lecture we introduce a few other frameworks
Alternating direction method of multipliers
Coordinate descent
Case studies
Another popular framework to deal with nonsmooth or constrained problems is the alternating direction method of multipliers (ADMM). It considers problems of the form:
minx,zF(x,z):=f(x)+g(z)subject toAx+Bz=c
x∈Rn, z∈Rm, A∈Rp×n, B∈Rp×m, and c∈Rp.
f(x) and g(z) are convex functions, possibly nonsmooth
Ax+Bz=c represents general linear equality constraints
Although this problem form is a bit exotic compared with those we have seen in previous lectures
Many familiar problems can be converted into this form by cleverly defining the x and z variables
Lasso:
minx 12∥b−Ax∥2+λ∥x∥1
f(x)=(1/2)∥b−Ax∥2, g(z)=λ∥z∥1
Linear constraint x−z=0
minx ∥b−Ax∥1
f(x)=0, g(z)=∥z∥1
Linear constraint Ax−z=b
ADMM can be expressed as the following algorithm. Given initial values z(0) and u(0), for k=0,1…, iterate
x(k+1)=argminx f(x)+(ρ/2)∥Ax+Bz(k)−c+u(k)∥2z(k+1)=argminz g(z)+(ρ/2)∥Ax(k+1)+Bz−c+u(k)∥2u(k+1)=u(k)+Ax(k+1)+Bz(k+1)−c
ρ>0 is an arbitray "inverse step size" parameter.
The x- and z-updates look similar to proximal operators
However, here we have more general quadratic terms
This means that the x- and z-updates are typically more difficult than proximal operators
On the other hand, ADMM does not need to explicitly project to the linear constraint set
These findings largely guide how to define the f and g functions
Consider using ADMM to solve the problem
minx f(x)+g(x)⇔minx,z f(x)+g(z) subject to x−z=0
Then the iteration becomes
x(k+1)=prox(1/ρ)f(z(k)−u(k))z(k+1)=prox(1/ρ)g(x(k+1)+u(k))u(k+1)=u(k)+x(k+1)−z(k+1)
This is equivalent to DRS, by slightly reparameterizing the variables.
Many implementation details can be found in the main reference [1]
Stopping criterion
Selection of the ρ parameter
Closed-form expressions for specific problems
The sequence (x(k),z(k),u(k)) satisfy
∥A(x(k+1)−x(k))∥2+∥B(z(k+1)−z(k))∥2+∥u(k+1)−u(k)∥2=o(1/k).
This is one of the most general convergence results on ADMM.
Many other faster rates exist by imposing stronger conditions on f(x) and g(z), e.g. smoothness, strong convexity, etc.
One more idea to overcome nonsmoothness is to directly find the optimal point of nonsmooth problems
Of course, this is unrealistic for general problems
But in many cases this is possible for univariate functions
Lasso model minβ ℓ(β):=minβ 12n∥y−Xβ∥2+λ∥β∥1
Solving the whole β vector is hard
However, assume we fix all elements of β except βi
Then we can actually derive the optimal βi in closed form
f(x∗)=minxf(x)⇔0∈∂f(x∗)
∂~ℓ(βi)=n−1∥Xi∥2βi+n−1X′i(X−iβ−i−y)+λ∂|βi|,
Xi is the i-column of X
X−i is the matrix formed by removing Xi from X
β−i is the vector formed by removing βi from β
βi=Snλ/∥Xi∥2(X′i(y−X−iβ−i)∥Xi∥2)
These findings motivate the coordinate descent (CD) algorithm. Consider the problem
minx f(x):=minx f(x1,…,xn) One common type of CD algorithm proceeds as follows. Given initial value x(0), for k=0,1,…, iterate:
x(k+1)i=argminxi f(x(k+1)1,…,x(k+1)i−1,xi,x(k)i+1,…,x(k)n) for i=1,…,n within each k-outer-iteration.
This is typically called the cyclic CD algorithm.
Another popular variant is to randomly select a coordinate to update. Given initial value x(0), for k=0,1,…, do:
Randomly select ik from {1,2,…,n} with equal probabilities
Fixing other coordinates and update
x(k+1)ik=argminxik f(x(k)1,…,x(k)ik−1,xik,x(k)ik+1,…,x(k)n)x(k+1)j=x(k)j,j≠ik
In the literature, this method is sometimes called randomized CD.
Many other variants exist:
Permutation CD: for each k, use a random permutation of {1,2,…,n} to define the order of coordinate updates
Block CD: each time update a block of coordinates simultaneously, instead of a single coordinate
Coordinate proximal gradient descent: use proximal operators to update coordinates
...
Unfortunately, CD does not always converge to an optimal point for nonsmooth f. Instead, consider the problem
f(x)=g(x)+n∑i=1hi(xi),
where g(x) is convex and differentiable, and each hi(xi) is convex, possibly nonsmooth.
[3] shows that with very mild regularity conditions, the CD iterates x(k) converge to an optimal point of f(x).
The result holds for block CD and arbitrary order of cycle through coordinates.
There exist many convergence results for different variants of CD based on different assumptions on f(x). Below is one representative result with weak assumptions.
Assume that g(x) is coordinate-wise Li-smooth, and define R0=maxymaxx∗∈X∗⎧⎨⎩(n∑i=1Li(yi−x∗i)2)1/2:f(y)≤f(x(0))⎫⎬⎭, where X∗ denotes the optimal set.
(See next slide)
Consider the randomized CD algorithm. Fix a confidence parameter 0<ρ<1 and an accuracy level ε<min{R20,f(x(0))−f(x∗)}. Then with
k≥2nR20εlogf(x(0))−f(x∗)ερ, we have f(x(k))−f(x∗)≤ε with probability at least 1−ρ.
For simplicity suppose Li≡L, and assume that g(x) is also strongly convex with parameter m>0. Fix a confidence parameter 0<ρ<1 and an accuracy level ε>0. If
k≥4Lnmlogf(x(0))−f(x∗)ερ,
then f(x(k))−f(x∗)≤ε with probability at least 1−ρ.
We use the lasso model minβ ℓ(β):=minβ 12n∥y−Xβ∥2+λ∥β∥1 to study the convergence property and computational efficiency of several algorithms we have covered in previous lectures.
Derive the update rules for several algorithms
Study the change of loss function versus iteration number
Study the change of loss function versus running time
The design matrix is a simulated sparse matrix, but we tentatively use its dense version.
library(Matrix)set.seed(123)n = 500p = 1000s = 30xsp = rsparsematrix(n, p, density = 0.1)x = as.matrix(xsp)beta = c(rnorm(s), numeric(p - s))y = as.numeric(xsp %*% beta) + 0.1 * rnorm(n)lam = 0.001objfn = function(beta, x, y, lam){ 0.5 * mean((y - as.numeric(x %*% beta))^2) + lam * sum(abs(beta))}
Use the celebrated glmnet package to compute the optimal value.
library(glmnet)res = glmnet(x, y, lambda = lam, thresh = 1e-15, standardize = FALSE, intercept = FALSE)beta_opt = as.numeric(res$beta)head(beta_opt)
## [1] -0.3862279 1.4304118 -0.2206914 1.1256892 0.1704705 0.8657187
tail(beta_opt)
## [1] 0.000000000 -0.003938278 0.000000000 0.000000000 0.000000000## [6] 0.008219107
obj_opt = objfn(beta_opt, x, y, lam)obj_opt
## [1] 0.02782515
A general interface for running optimization algorithms.
run_opt = function(x, y, lam, init_state, update_rule, maxit = 1001, ...){ ...}
init_state()
and update_rule()
are two function arguments to characterize the optimization algorithm
init_state(x, y, lam, ...)
is used to create initial state of the algorithm
update_rule(beta, iter, x, y, lam, state)
returns the updated β and state, list(beta, state)
A general interface for running optimization algorithms.
run_opt = function(x, y, lam, init_state, update_rule, maxit = 1001, ...){ p = ncol(x) beta_hat = numeric(p) losses = c() # Initialization state = init_state(x, y, lam, ...) # Main loop for(i in 1:maxit) { res = update_rule(beta_hat, i - 1, x, y, lam, state) beta_hat = res$beta state = res$state loss = objfn(beta_hat, x, y, lam) losses = c(losses, loss) if((i - 1) %% 100 == 0) cat(sprintf("Iter %d, loss = %.6f\n", i - 1, loss)) } list(beta_hat = beta_hat, losses = losses)}
Choose ~∇ℓ(β)=(1/n)X′(Xβ−y)+λ⋅sign(β)
Update rule: β(k+1)=β(k)−αk~∇ℓ(β(k))
Step size αk=α0/√k+1
init_subgrad = function(x, y, lam, alpha0 = 1){ list(alpha0 = alpha0)}update_subgrad = function(beta, k, x, y, lam, state){ n = nrow(x) alpha0 = state$alpha0 alphak = alpha0 / sqrt(k + 1) Xtr = crossprod(x, as.numeric(x %*% beta) - y) subgrad = as.numeric(Xtr) / n + lam * sign(beta) new_beta = beta - alphak * subgrad list(beta = new_beta, state = state)}
res_subgrad1 = run_opt( x, y, lam, init_subgrad, update_subgrad, alpha0 = 1)
## Iter 0, loss = 0.770908## Iter 100, loss = 0.084973## Iter 200, loss = 0.074552## Iter 300, loss = 0.070417## Iter 400, loss = 0.067906## Iter 500, loss = 0.066074## Iter 600, loss = 0.064626## Iter 700, loss = 0.063428## Iter 800, loss = 0.062394## Iter 900, loss = 0.061477## Iter 1000, loss = 0.060661
head(res_subgrad1$beta)
## [1] -2.174641e-01 9.316010e-01 -1.026982e-01 6.878443e-01 1.129936e-05## [6] 2.784390e-01
tail(res_subgrad1$beta)
## [1] -3.352589e-02 -5.879592e-02 -1.432275e-06 -3.051985e-02 5.442229e-02## [6] 1.141592e-01
res_subgrad10 = run_opt( x, y, lam, init_subgrad, update_subgrad, alpha0 = 10)
## Iter 0, loss = 8.881744## Iter 100, loss = 0.049232## Iter 200, loss = 0.040807## Iter 300, loss = 0.036166## Iter 400, loss = 0.032993## Iter 500, loss = 0.030828## Iter 600, loss = 0.029429## Iter 700, loss = 0.028636## Iter 800, loss = 0.028227## Iter 900, loss = 0.028043## Iter 1000, loss = 0.027961
head(res_subgrad10$beta)
## [1] -0.3913635 1.4284850 -0.2179803 1.1101840 0.1683580 0.8471952
tail(res_subgrad10$beta)
## [1] -2.979032e-04 -3.842055e-03 2.795830e-04 -9.758347e-05 -1.092878e-04## [6] 1.166201e-02
Update rule: β(k+1)=Sαλ(β(k)−(α/n)X′(Xβ(k)−y))
Step size α=1/λmax(n−1X′X)=n/λmax(X′X)
library(RSpectra)# Soft-thresholding functionsoft_thresh = function(x, a){ sign(x) * pmax(0, abs(x) - a)}init_ista = function(x, y, lam){ # Compute the largest eigenvalue of X'X alpha = nrow(x) / svds(x, k = 1, nu = 0, nv = 0)$d^2 list(alpha = alpha)}update_ista = function(beta, k, x, y, lam, state){ n = nrow(x) alpha = state$alpha Xtr = crossprod(x, as.numeric(x %*% beta) - y) grad = as.numeric(Xtr) / n new_beta = soft_thresh(beta - alpha * grad, alpha * lam) list(beta = new_beta, state = state)}
res_ista = run_opt(x, y, lam, init_ista, update_ista)
## Iter 0, loss = 0.494847## Iter 100, loss = 0.044493## Iter 200, loss = 0.032321## Iter 300, loss = 0.028040## Iter 400, loss = 0.027830## Iter 500, loss = 0.027825## Iter 600, loss = 0.027825## Iter 700, loss = 0.027825## Iter 800, loss = 0.027825## Iter 900, loss = 0.027825## Iter 1000, loss = 0.027825
head(res_ista$beta)
## [1] -0.3862280 1.4304114 -0.2206905 1.1256887 0.1704708 0.8657158
tail(res_ista$beta)
## [1] 0.000000000 -0.003938134 0.000000000 0.000000000 0.000000000## [6] 0.008219716
u(k+1)=β(k)+k−1k+2(β(k)−β(k−1))β(k+1)=Sαλ(u(k)−(α/n)X′(Xu(k)−y))
init_fista = function(x, y, lam){ p = ncol(x) alpha = nrow(x) / svds(x, k = 1, nu = 0, nv = 0)$d^2 list(alpha = alpha, old_beta = numeric(p), betau = numeric(p))}update_fista = function(beta, k, x, y, lam, state){ n = nrow(x) alpha = state$alpha state$betau = if(k <= 1) beta else beta + (k - 1) / (k + 2) * (beta - state$old_beta) Xtr = crossprod(x, as.numeric(x %*% state$betau) - y) grad = as.numeric(Xtr) / n new_beta = soft_thresh(state$betau - alpha * grad, alpha * lam) state$old_beta = beta list(beta = new_beta, state = state)}
res_fista = run_opt(x, y, lam, init_fista, update_fista)
## Iter 0, loss = 0.494847## Iter 100, loss = 0.027830## Iter 200, loss = 0.027825## Iter 300, loss = 0.027825## Iter 400, loss = 0.027825## Iter 500, loss = 0.027825## Iter 600, loss = 0.027825## Iter 700, loss = 0.027825## Iter 800, loss = 0.027825## Iter 900, loss = 0.027825## Iter 1000, loss = 0.027825
head(res_fista$beta)
## [1] -0.3862278 1.4304116 -0.2206913 1.1256895 0.1704706 0.8657180
tail(res_fista$beta)
## [1] 0.000000000 -0.003938369 0.000000000 0.000000000 0.000000000## [6] 0.008219289
proxαg(z)=Sαλ(z)proxαh(β)=(I+(α/n)X′X)−1(β+(α/n)X′y)
β(k+1)=Sαλ(u(k))u(k+1)=u(k)+A−1(2β(k+1)−u(k)+(α/n)X′y)−β(k+1) where A=I+(α/n)X′X
chol_solve = function(R, b){ y = backsolve(R, b, transpose = TRUE) # R'*y = b x = backsolve(R, y) # R*x = y x}init_drs = function(x, y, lam, alpha = 1){ n = nrow(x) p = ncol(x) scaled_xty = alpha / n * as.numeric(crossprod(x, y)) R = chol(diag(p) + alpha / n * crossprod(x)) list(alpha = alpha, sxty = scaled_xty, R = as.matrix(R), betau = numeric(p))}update_drs = function(beta, k, x, y, lam, state){ n = nrow(x) alpha = state$alpha new_beta = soft_thresh(state$betau, alpha * lam) rhs = 2 * new_beta - state$betau + state$sxty state$betau = state$betau + chol_solve(state$R, rhs) - new_beta list(beta = new_beta, state = state)}
res_drs1 = run_opt(x, y, lam, init_drs, update_drs, alpha = 1)
## Iter 0, loss = 1.454710## Iter 100, loss = 0.053128## Iter 200, loss = 0.041675## Iter 300, loss = 0.034461## Iter 400, loss = 0.029864## Iter 500, loss = 0.028097## Iter 600, loss = 0.027848## Iter 700, loss = 0.027828## Iter 800, loss = 0.027826## Iter 900, loss = 0.027825## Iter 1000, loss = 0.027825
head(res_drs1$beta)
## [1] -0.3863738 1.4304559 -0.2205224 1.1254740 0.1704870 0.8652928
tail(res_drs1$beta)
## [1] 0.000000000 -0.003902763 0.000000000 0.000000000 0.000000000## [6] 0.008261868
res_drs10 = run_opt(x, y, lam, init_drs, update_drs, alpha = 10)
## Iter 0, loss = 1.454710## Iter 100, loss = 0.027825## Iter 200, loss = 0.027825## Iter 300, loss = 0.027825## Iter 400, loss = 0.027825## Iter 500, loss = 0.027825## Iter 600, loss = 0.027825## Iter 700, loss = 0.027825## Iter 800, loss = 0.027825## Iter 900, loss = 0.027825## Iter 1000, loss = 0.027825
head(res_drs10$beta)
## [1] -0.3862278 1.4304117 -0.2206913 1.1256895 0.1704706 0.8657182
tail(res_drs10$beta)
## [1] 0.000000000 -0.003938377 0.000000000 0.000000000 0.000000000## [6] 0.008219268
Let f(β)=λ∥β∥1, g(β)=12n∥y−Xβ∥2. Then
Update rule:
β(k+1)=Sλ/ρ(z(k)−u(k))z(k+1)=A−1(β(k+1)+u(k)+(nρ)−1X′y)u(k+1)=u(k)+β(k+1)−z(k+1) where A=I+(nρ)−1X′X
init_admm = function(x, y, lam, rho = 1){ n = nrow(x) p = ncol(x) scaled_xty = as.numeric(crossprod(x, y)) / n / rho R = chol(diag(p) + crossprod(x) / n / rho) list(rho = rho, sxty = scaled_xty, R = as.matrix(R), betaz = numeric(p), betau = numeric(p))}update_admm = function(beta, k, x, y, lam, state){ n = nrow(x) rho = state$rho new_beta = soft_thresh(state$betaz - state$betau, lam / rho) rhs = new_beta + state$betau + state$sxty state$betaz = chol_solve(state$R, rhs) state$betau = state$betau + new_beta - state$betaz list(beta = new_beta, state = state)}
res_admm1 = run_opt(x, y, lam, init_admm, update_admm, rho = 1)
## Iter 0, loss = 1.454710## Iter 100, loss = 0.053098## Iter 200, loss = 0.041664## Iter 300, loss = 0.034453## Iter 400, loss = 0.029859## Iter 500, loss = 0.028096## Iter 600, loss = 0.027848## Iter 700, loss = 0.027828## Iter 800, loss = 0.027826## Iter 900, loss = 0.027825## Iter 1000, loss = 0.027825
head(res_admm1$beta)
## [1] -0.3863737 1.4304559 -0.2205226 1.1254743 0.1704871 0.8652934
tail(res_admm1$beta)
## [1] 0.000000000 -0.003902788 0.000000000 0.000000000 0.000000000## [6] 0.008261821
res_admm0.1 = run_opt(x, y, lam, init_admm, update_admm, rho = 0.1)
## Iter 0, loss = 1.454710## Iter 100, loss = 0.027825## Iter 200, loss = 0.027825## Iter 300, loss = 0.027825## Iter 400, loss = 0.027825## Iter 500, loss = 0.027825## Iter 600, loss = 0.027825## Iter 700, loss = 0.027825## Iter 800, loss = 0.027825## Iter 900, loss = 0.027825## Iter 1000, loss = 0.027825
head(res_admm0.1$beta)
## [1] -0.3862278 1.4304117 -0.2206913 1.1256895 0.1704706 0.8657182
tail(res_admm0.1$beta)
## [1] 0.000000000 -0.003938377 0.000000000 0.000000000 0.000000000## [6] 0.008219268
Each iteration contains a full cycle of βi updates
Update rule:
βi←Snλ/∥Xi∥2(X′i(y−X−iβ−i)∥Xi∥2)
βi←Snλ/∥Xi∥2(βi+X′ir∥Xi∥2),r=y−Xβ
# Soft-thresholding functionsoft_thresh = function(x, a){ sign(x) * pmax(0, abs(x) - a)}init_cd = function(x, y, lam){ xi_square = colSums(x^2) thresh = nrow(x) * lam / xi_square list(xi_square = xi_square, thresh = thresh)}update_cd = function(beta, k, x, y, lam, state){ new_beta = beta # betai <- S(betai + xi' * r / ||xi||^2, t), r = y - x * beta r = y - c(x %*% beta) for(i in seq_along(new_beta)) { old = new_beta[i] newb = old + sum(x[, i] * r) / state$xi_square[i] thresh = state$thresh[i] new_beta[i] = soft_thresh(newb, thresh) r = r - sum(new_beta[i] - old) * x[, i] } list(beta = new_beta, state = state)}
res_cd = run_opt(x, y, lam, init_cd, update_cd)
## Iter 0, loss = 0.046708## Iter 100, loss = 0.027825## Iter 200, loss = 0.027825## Iter 300, loss = 0.027825## Iter 400, loss = 0.027825## Iter 500, loss = 0.027825## Iter 600, loss = 0.027825## Iter 700, loss = 0.027825## Iter 800, loss = 0.027825## Iter 900, loss = 0.027825## Iter 1000, loss = 0.027825
head(res_cd$beta)
## [1] -0.3862278 1.4304117 -0.2206913 1.1256895 0.1704706 0.8657182
tail(res_cd$beta)
## [1] 0.000000000 -0.003938377 0.000000000 0.000000000 0.000000000## [6] 0.008219268
K = 1000gdat = data.frame( iter = rep(1:K, times = 8), loss = c(res_subgrad1$losses[1:K], res_subgrad10$losses[1:K], res_ista$losses[1:K], res_fista$losses[1:K], res_drs1$losses[1:K], res_drs10$losses[1:K], res_admm1$losses[1:K], res_cd$losses[1:K]), method = rep(c("Subgrad α=1", "Subgrad α=10", "ISTA", "FISTA", "DRS α=1", "DRS α=10", "ADMM ρ=1", "CD"), each = K))
library(ggplot2)ggplot(gdat, aes(x = iter, y = log10(loss - obj_opt))) + geom_line(aes(color = method)) + scale_color_hue("Method") + xlab("Iteration") + ylab("Log-error") + theme_bw()
The previous comparison is neither rigorous nor complete, and more works need to be done:
The plot only shows error v.s. iteration, but not error v.s. computing time
The direct factorization of A=I+(α/n)X′X is not optimal (recall that p>n)
Design of algorithms when X is sparse (i.e., operating on the xsp
object)
Left as after-class exercises.
[1] Stephen Boyd et al. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning.
[2] Deng Wei et al. (2017). Parallel multi-block ADMM with O(1/k) convergence. Journal of Scientific Computing.
[3] Paul Tseng (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications.
[4] Peter Richtárik and Martin Takáč (2014). Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |