+ - 0:00:00
Notes for current slide
Notes for next slide

Computational Statistics

Lecture 9

Yixuan Qiu

2023-11-08

1 / 56

Optimization

2 / 56

Last Time

  • We have spent much time on nonsmooth problems

  • Since they are very common in statistical models

  • Meanwhile they are considered "hard problems" in convex optimization

  • So far, the main tool to overcome nonsmoothness is the proximal operator

  • Seen in proximal gradient descent, Douglas-Rachford splitting, Davis-Yin splitting, etc.

  • In this lecture we introduce a few other frameworks

3 / 56

Today's Topics

  • Alternating direction method of multipliers

  • Coordinate descent

  • Case studies

4 / 56

Alternating Direction Method of Multipliers

5 / 56

Problem

Another popular framework to deal with nonsmooth or constrained problems is the alternating direction method of multipliers (ADMM). It considers problems of the form:

minx,zF(x,z):=f(x)+g(z)subject toAx+Bz=c

  • xRn, zRm, ARp×n, BRp×m, and cRp.

  • f(x) and g(z) are convex functions, possibly nonsmooth

  • Ax+Bz=c represents general linear equality constraints

6 / 56

Examples

  • Although this problem form is a bit exotic compared with those we have seen in previous lectures

  • Many familiar problems can be converted into this form by cleverly defining the x and z variables

  • Lasso:

minx 12bAx2+λx1

  • f(x)=(1/2)bAx2, g(z)=λz1

  • Linear constraint xz=0

7 / 56

Examples

  • Least absolute deviations:

minx bAx1

  • f(x)=0, g(z)=z1

  • Linear constraint Axz=b

8 / 56

Algorithm

ADMM can be expressed as the following algorithm. Given initial values z(0) and u(0), for k=0,1, iterate

x(k+1)=argminx f(x)+(ρ/2)Ax+Bz(k)c+u(k)2z(k+1)=argminz g(z)+(ρ/2)Ax(k+1)+Bzc+u(k)2u(k+1)=u(k)+Ax(k+1)+Bz(k+1)c

ρ>0 is an arbitray "inverse step size" parameter.

9 / 56

Some Remarks

  • The x- and z-updates look similar to proximal operators

  • However, here we have more general quadratic terms

  • This means that the x- and z-updates are typically more difficult than proximal operators

  • On the other hand, ADMM does not need to explicitly project to the linear constraint set

  • These findings largely guide how to define the f and g functions

10 / 56

Relation with DRS

Consider using ADMM to solve the problem

minx f(x)+g(x)minx,z f(x)+g(z)  subject to  xz=0

Then the iteration becomes

x(k+1)=prox(1/ρ)f(z(k)u(k))z(k+1)=prox(1/ρ)g(x(k+1)+u(k))u(k+1)=u(k)+x(k+1)z(k+1)

This is equivalent to DRS, by slightly reparameterizing the variables.

11 / 56

Implementation Details

  • Many implementation details can be found in the main reference [1]

  • Stopping criterion

  • Selection of the ρ parameter

  • Closed-form expressions for specific problems

12 / 56

Convergence Property [2]

The sequence (x(k),z(k),u(k)) satisfy

A(x(k+1)x(k))2+B(z(k+1)z(k))2+u(k+1)u(k)2=o(1/k).

This is one of the most general convergence results on ADMM.

Many other faster rates exist by imposing stronger conditions on f(x) and g(z), e.g. smoothness, strong convexity, etc.

13 / 56

Coordinate Descent

14 / 56

Motivation

  • One more idea to overcome nonsmoothness is to directly find the optimal point of nonsmooth problems

  • Of course, this is unrealistic for general problems

  • But in many cases this is possible for univariate functions

15 / 56

Example

Lasso model minβ (β):=minβ 12nyXβ2+λβ1

  • Solving the whole β vector is hard

  • However, assume we fix all elements of β except βi

  • Then we can actually derive the optimal βi in closed form

16 / 56

Example

  • Recall that in Lecture 7 we have shown

f(x)=minxf(x)0f(x)

  • View ~(βi)(β) as a function of βi, then

~(βi)=n1Xi2βi+n1Xi(Xiβiy)+λ|βi|,

  • Xi is the i-column of X

  • Xi is the matrix formed by removing Xi from X

  • βi is the vector formed by removing βi from β

17 / 56

Example

  • 0~(βi) gives the solution

βi=Snλ/Xi2(Xi(yXiβi)Xi2)

18 / 56

Coordinate Descent

These findings motivate the coordinate descent (CD) algorithm. Consider the problem

minx f(x):=minx f(x1,,xn) One common type of CD algorithm proceeds as follows. Given initial value x(0), for k=0,1,, iterate:

xi(k+1)=argminxi f(x1(k+1),,xi1(k+1),xi,xi+1(k),,xn(k)) for i=1,,n within each k-outer-iteration.

This is typically called the cyclic CD algorithm.

19 / 56

Variants

Another popular variant is to randomly select a coordinate to update. Given initial value x(0), for k=0,1,, do:

  1. Randomly select ik from {1,2,,n} with equal probabilities

  2. Fixing other coordinates and update

xik(k+1)=argminxik f(x1(k),,xik1(k),xik,xik+1(k),,xn(k))xj(k+1)=xj(k),jik

In the literature, this method is sometimes called randomized CD.

20 / 56

Variants

Many other variants exist:

  • Permutation CD: for each k, use a random permutation of {1,2,,n} to define the order of coordinate updates

  • Block CD: each time update a block of coordinates simultaneously, instead of a single coordinate

  • Coordinate proximal gradient descent: use proximal operators to update coordinates

  • ...

21 / 56

Convergence Property [3]

Unfortunately, CD does not always converge to an optimal point for nonsmooth f. Instead, consider the problem

f(x)=g(x)+i=1nhi(xi),

where g(x) is convex and differentiable, and each hi(xi) is convex, possibly nonsmooth.

[3] shows that with very mild regularity conditions, the CD iterates x(k) converge to an optimal point of f(x).

The result holds for block CD and arbitrary order of cycle through coordinates.

22 / 56

Convergence Property [4]

There exist many convergence results for different variants of CD based on different assumptions on f(x). Below is one representative result with weak assumptions.

Assume that g(x) is coordinate-wise Li-smooth, and define R0=maxymaxxX{(i=1nLi(yixi)2)1/2:f(y)f(x(0))}, where X denotes the optimal set.

(See next slide)

23 / 56

Convergence Property [4]

Consider the randomized CD algorithm. Fix a confidence parameter 0<ρ<1 and an accuracy level ε<min{R02,f(x(0))f(x)}. Then with

k2nR02εlogf(x(0))f(x)ερ, we have f(x(k))f(x)ε with probability at least 1ρ.

24 / 56

Convergence with Strong Convexity [4]

For simplicity suppose LiL, and assume that g(x) is also strongly convex with parameter m>0. Fix a confidence parameter 0<ρ<1 and an accuracy level ε>0. If

k4Lnmlogf(x(0))f(x)ερ,

then f(x(k))f(x)ε with probability at least 1ρ.

25 / 56

Case Study

26 / 56

Lasso

We use the lasso model minβ (β):=minβ 12nyXβ2+λβ1 to study the convergence property and computational efficiency of several algorithms we have covered in previous lectures.

  • Derive the update rules for several algorithms

  • Study the change of loss function versus iteration number

  • Study the change of loss function versus running time

27 / 56

Simulate Data Set

The design matrix is a simulated sparse matrix, but we tentatively use its dense version.

library(Matrix)
set.seed(123)
n = 500
p = 1000
s = 30
xsp = rsparsematrix(n, p, density = 0.1)
x = as.matrix(xsp)
beta = c(rnorm(s), numeric(p - s))
y = as.numeric(xsp %*% beta) + 0.1 * rnorm(n)
lam = 0.001
objfn = function(beta, x, y, lam)
{
0.5 * mean((y - as.numeric(x %*% beta))^2) + lam * sum(abs(beta))
}
28 / 56

Compute Optimal Value

Use the celebrated glmnet package to compute the optimal value.

library(glmnet)
res = glmnet(x, y, lambda = lam, thresh = 1e-15,
standardize = FALSE, intercept = FALSE)
beta_opt = as.numeric(res$beta)
head(beta_opt)
## [1] -0.3862279 1.4304118 -0.2206914 1.1256892 0.1704705 0.8657187
tail(beta_opt)
## [1] 0.000000000 -0.003938278 0.000000000 0.000000000 0.000000000
## [6] 0.008219107
obj_opt = objfn(beta_opt, x, y, lam)
obj_opt
## [1] 0.02782515
29 / 56

Optimization

A general interface for running optimization algorithms.

run_opt = function(x, y, lam, init_state, update_rule, maxit = 1001, ...)
{
...
}
  • init_state() and update_rule() are two function arguments to characterize the optimization algorithm

  • init_state(x, y, lam, ...) is used to create initial state of the algorithm

  • update_rule(beta, iter, x, y, lam, state) returns the updated β and state, list(beta, state)

30 / 56

Optimization

A general interface for running optimization algorithms.

run_opt = function(x, y, lam, init_state, update_rule, maxit = 1001, ...)
{
p = ncol(x)
beta_hat = numeric(p)
losses = c()
# Initialization
state = init_state(x, y, lam, ...)
# Main loop
for(i in 1:maxit)
{
res = update_rule(beta_hat, i - 1, x, y, lam, state)
beta_hat = res$beta
state = res$state
loss = objfn(beta_hat, x, y, lam)
losses = c(losses, loss)
if((i - 1) %% 100 == 0)
cat(sprintf("Iter %d, loss = %.6f\n", i - 1, loss))
}
list(beta_hat = beta_hat, losses = losses)
}
31 / 56

Subgradient Method

  • Choose ~(β)=(1/n)X(Xβy)+λsign(β)

  • Update rule: β(k+1)=β(k)αk~(β(k))

  • Step size αk=α0/k+1

32 / 56

Implementation

init_subgrad = function(x, y, lam, alpha0 = 1)
{
list(alpha0 = alpha0)
}
update_subgrad = function(beta, k, x, y, lam, state)
{
n = nrow(x)
alpha0 = state$alpha0
alphak = alpha0 / sqrt(k + 1)
Xtr = crossprod(x, as.numeric(x %*% beta) - y)
subgrad = as.numeric(Xtr) / n + lam * sign(beta)
new_beta = beta - alphak * subgrad
list(beta = new_beta, state = state)
}
33 / 56

Result

res_subgrad1 = run_opt(
x, y, lam, init_subgrad, update_subgrad, alpha0 = 1)
## Iter 0, loss = 0.770908
## Iter 100, loss = 0.084973
## Iter 200, loss = 0.074552
## Iter 300, loss = 0.070417
## Iter 400, loss = 0.067906
## Iter 500, loss = 0.066074
## Iter 600, loss = 0.064626
## Iter 700, loss = 0.063428
## Iter 800, loss = 0.062394
## Iter 900, loss = 0.061477
## Iter 1000, loss = 0.060661
head(res_subgrad1$beta)
## [1] -2.174641e-01 9.316010e-01 -1.026982e-01 6.878443e-01 1.129936e-05
## [6] 2.784390e-01
tail(res_subgrad1$beta)
## [1] -3.352589e-02 -5.879592e-02 -1.432275e-06 -3.051985e-02 5.442229e-02
## [6] 1.141592e-01
34 / 56

Result

res_subgrad10 = run_opt(
x, y, lam, init_subgrad, update_subgrad, alpha0 = 10)
## Iter 0, loss = 8.881744
## Iter 100, loss = 0.049232
## Iter 200, loss = 0.040807
## Iter 300, loss = 0.036166
## Iter 400, loss = 0.032993
## Iter 500, loss = 0.030828
## Iter 600, loss = 0.029429
## Iter 700, loss = 0.028636
## Iter 800, loss = 0.028227
## Iter 900, loss = 0.028043
## Iter 1000, loss = 0.027961
head(res_subgrad10$beta)
## [1] -0.3913635 1.4284850 -0.2179803 1.1101840 0.1683580 0.8471952
tail(res_subgrad10$beta)
## [1] -2.979032e-04 -3.842055e-03 2.795830e-04 -9.758347e-05 -1.092878e-04
## [6] 1.166201e-02
35 / 56

Proximal Gradient Descent

  • Update rule: β(k+1)=Sαλ(β(k)(α/n)X(Xβ(k)y))

  • Step size α=1/λmax(n1XX)=n/λmax(XX)

36 / 56

Implementation

library(RSpectra)
# Soft-thresholding function
soft_thresh = function(x, a)
{
sign(x) * pmax(0, abs(x) - a)
}
init_ista = function(x, y, lam)
{
# Compute the largest eigenvalue of X'X
alpha = nrow(x) / svds(x, k = 1, nu = 0, nv = 0)$d^2
list(alpha = alpha)
}
update_ista = function(beta, k, x, y, lam, state)
{
n = nrow(x)
alpha = state$alpha
Xtr = crossprod(x, as.numeric(x %*% beta) - y)
grad = as.numeric(Xtr) / n
new_beta = soft_thresh(beta - alpha * grad, alpha * lam)
list(beta = new_beta, state = state)
}
37 / 56

Result

res_ista = run_opt(x, y, lam, init_ista, update_ista)
## Iter 0, loss = 0.494847
## Iter 100, loss = 0.044493
## Iter 200, loss = 0.032321
## Iter 300, loss = 0.028040
## Iter 400, loss = 0.027830
## Iter 500, loss = 0.027825
## Iter 600, loss = 0.027825
## Iter 700, loss = 0.027825
## Iter 800, loss = 0.027825
## Iter 900, loss = 0.027825
## Iter 1000, loss = 0.027825
head(res_ista$beta)
## [1] -0.3862280 1.4304114 -0.2206905 1.1256887 0.1704708 0.8657158
tail(res_ista$beta)
## [1] 0.000000000 -0.003938134 0.000000000 0.000000000 0.000000000
## [6] 0.008219716
38 / 56

Accelerated Proximal Gradient Descent

  • Update rule:

u(k+1)=β(k)+k1k+2(β(k)β(k1))β(k+1)=Sαλ(u(k)(α/n)X(Xu(k)y))

  • Step size α=1/λmax(n1XX)=n/λmax(XX)
39 / 56

Implementation

init_fista = function(x, y, lam)
{
p = ncol(x)
alpha = nrow(x) / svds(x, k = 1, nu = 0, nv = 0)$d^2
list(alpha = alpha, old_beta = numeric(p), betau = numeric(p))
}
update_fista = function(beta, k, x, y, lam, state)
{
n = nrow(x)
alpha = state$alpha
state$betau = if(k <= 1) beta else
beta + (k - 1) / (k + 2) * (beta - state$old_beta)
Xtr = crossprod(x, as.numeric(x %*% state$betau) - y)
grad = as.numeric(Xtr) / n
new_beta = soft_thresh(state$betau - alpha * grad, alpha * lam)
state$old_beta = beta
list(beta = new_beta, state = state)
}
40 / 56

Result

res_fista = run_opt(x, y, lam, init_fista, update_fista)
## Iter 0, loss = 0.494847
## Iter 100, loss = 0.027830
## Iter 200, loss = 0.027825
## Iter 300, loss = 0.027825
## Iter 400, loss = 0.027825
## Iter 500, loss = 0.027825
## Iter 600, loss = 0.027825
## Iter 700, loss = 0.027825
## Iter 800, loss = 0.027825
## Iter 900, loss = 0.027825
## Iter 1000, loss = 0.027825
head(res_fista$beta)
## [1] -0.3862278 1.4304116 -0.2206913 1.1256895 0.1704706 0.8657180
tail(res_fista$beta)
## [1] 0.000000000 -0.003938369 0.000000000 0.000000000 0.000000000
## [6] 0.008219289
41 / 56

Douglas-Rachford Splitting

  • Let g(β)=λβ1, f(β)=12nyXβ2. Then

proxαg(z)=Sαλ(z)proxαh(β)=(I+(α/n)XX)1(β+(α/n)Xy)

  • Update rule:

β(k+1)=Sαλ(u(k))u(k+1)=u(k)+A1(2β(k+1)u(k)+(α/n)Xy)β(k+1) where A=I+(α/n)XX

42 / 56

Implementation

chol_solve = function(R, b)
{
y = backsolve(R, b, transpose = TRUE) # R'*y = b
x = backsolve(R, y) # R*x = y
x
}
init_drs = function(x, y, lam, alpha = 1)
{
n = nrow(x)
p = ncol(x)
scaled_xty = alpha / n * as.numeric(crossprod(x, y))
R = chol(diag(p) + alpha / n * crossprod(x))
list(alpha = alpha, sxty = scaled_xty,
R = as.matrix(R), betau = numeric(p))
}
update_drs = function(beta, k, x, y, lam, state)
{
n = nrow(x)
alpha = state$alpha
new_beta = soft_thresh(state$betau, alpha * lam)
rhs = 2 * new_beta - state$betau + state$sxty
state$betau = state$betau + chol_solve(state$R, rhs) - new_beta
list(beta = new_beta, state = state)
}
43 / 56

Result

res_drs1 = run_opt(x, y, lam, init_drs, update_drs, alpha = 1)
## Iter 0, loss = 1.454710
## Iter 100, loss = 0.053128
## Iter 200, loss = 0.041675
## Iter 300, loss = 0.034461
## Iter 400, loss = 0.029864
## Iter 500, loss = 0.028097
## Iter 600, loss = 0.027848
## Iter 700, loss = 0.027828
## Iter 800, loss = 0.027826
## Iter 900, loss = 0.027825
## Iter 1000, loss = 0.027825
head(res_drs1$beta)
## [1] -0.3863738 1.4304559 -0.2205224 1.1254740 0.1704870 0.8652928
tail(res_drs1$beta)
## [1] 0.000000000 -0.003902763 0.000000000 0.000000000 0.000000000
## [6] 0.008261868
44 / 56

Result

res_drs10 = run_opt(x, y, lam, init_drs, update_drs, alpha = 10)
## Iter 0, loss = 1.454710
## Iter 100, loss = 0.027825
## Iter 200, loss = 0.027825
## Iter 300, loss = 0.027825
## Iter 400, loss = 0.027825
## Iter 500, loss = 0.027825
## Iter 600, loss = 0.027825
## Iter 700, loss = 0.027825
## Iter 800, loss = 0.027825
## Iter 900, loss = 0.027825
## Iter 1000, loss = 0.027825
head(res_drs10$beta)
## [1] -0.3862278 1.4304117 -0.2206913 1.1256895 0.1704706 0.8657182
tail(res_drs10$beta)
## [1] 0.000000000 -0.003938377 0.000000000 0.000000000 0.000000000
## [6] 0.008219268
45 / 56

ADMM

  • Let f(β)=λβ1, g(β)=12nyXβ2. Then

  • Update rule:

β(k+1)=Sλ/ρ(z(k)u(k))z(k+1)=A1(β(k+1)+u(k)+(nρ)1Xy)u(k+1)=u(k)+β(k+1)z(k+1) where A=I+(nρ)1XX

  • A reparameterized DRS
46 / 56

Implementation

init_admm = function(x, y, lam, rho = 1)
{
n = nrow(x)
p = ncol(x)
scaled_xty = as.numeric(crossprod(x, y)) / n / rho
R = chol(diag(p) + crossprod(x) / n / rho)
list(rho = rho, sxty = scaled_xty, R = as.matrix(R),
betaz = numeric(p), betau = numeric(p))
}
update_admm = function(beta, k, x, y, lam, state)
{
n = nrow(x)
rho = state$rho
new_beta = soft_thresh(state$betaz - state$betau, lam / rho)
rhs = new_beta + state$betau + state$sxty
state$betaz = chol_solve(state$R, rhs)
state$betau = state$betau + new_beta - state$betaz
list(beta = new_beta, state = state)
}
47 / 56

Result

res_admm1 = run_opt(x, y, lam, init_admm, update_admm, rho = 1)
## Iter 0, loss = 1.454710
## Iter 100, loss = 0.053098
## Iter 200, loss = 0.041664
## Iter 300, loss = 0.034453
## Iter 400, loss = 0.029859
## Iter 500, loss = 0.028096
## Iter 600, loss = 0.027848
## Iter 700, loss = 0.027828
## Iter 800, loss = 0.027826
## Iter 900, loss = 0.027825
## Iter 1000, loss = 0.027825
head(res_admm1$beta)
## [1] -0.3863737 1.4304559 -0.2205226 1.1254743 0.1704871 0.8652934
tail(res_admm1$beta)
## [1] 0.000000000 -0.003902788 0.000000000 0.000000000 0.000000000
## [6] 0.008261821
48 / 56

Result

res_admm0.1 = run_opt(x, y, lam, init_admm, update_admm, rho = 0.1)
## Iter 0, loss = 1.454710
## Iter 100, loss = 0.027825
## Iter 200, loss = 0.027825
## Iter 300, loss = 0.027825
## Iter 400, loss = 0.027825
## Iter 500, loss = 0.027825
## Iter 600, loss = 0.027825
## Iter 700, loss = 0.027825
## Iter 800, loss = 0.027825
## Iter 900, loss = 0.027825
## Iter 1000, loss = 0.027825
head(res_admm0.1$beta)
## [1] -0.3862278 1.4304117 -0.2206913 1.1256895 0.1704706 0.8657182
tail(res_admm0.1$beta)
## [1] 0.000000000 -0.003938377 0.000000000 0.000000000 0.000000000
## [6] 0.008219268
49 / 56

Coordinate Descent

  • Each iteration contains a full cycle of βi updates

  • Update rule:

βiSnλ/Xi2(Xi(yXiβi)Xi2)

  • A "cleverer" scheme is to note that

βiSnλ/Xi2(βi+XirXi2),r=yXβ

  • Then update rr(βinewβiold)Xi
50 / 56

Implementation

# Soft-thresholding function
soft_thresh = function(x, a)
{
sign(x) * pmax(0, abs(x) - a)
}
init_cd = function(x, y, lam)
{
xi_square = colSums(x^2)
thresh = nrow(x) * lam / xi_square
list(xi_square = xi_square, thresh = thresh)
}
update_cd = function(beta, k, x, y, lam, state)
{
new_beta = beta
# betai <- S(betai + xi' * r / ||xi||^2, t), r = y - x * beta
r = y - c(x %*% beta)
for(i in seq_along(new_beta))
{
old = new_beta[i]
newb = old + sum(x[, i] * r) / state$xi_square[i]
thresh = state$thresh[i]
new_beta[i] = soft_thresh(newb, thresh)
r = r - sum(new_beta[i] - old) * x[, i]
}
list(beta = new_beta, state = state)
}
51 / 56

Result

res_cd = run_opt(x, y, lam, init_cd, update_cd)
## Iter 0, loss = 0.046708
## Iter 100, loss = 0.027825
## Iter 200, loss = 0.027825
## Iter 300, loss = 0.027825
## Iter 400, loss = 0.027825
## Iter 500, loss = 0.027825
## Iter 600, loss = 0.027825
## Iter 700, loss = 0.027825
## Iter 800, loss = 0.027825
## Iter 900, loss = 0.027825
## Iter 1000, loss = 0.027825
head(res_cd$beta)
## [1] -0.3862278 1.4304117 -0.2206913 1.1256895 0.1704706 0.8657182
tail(res_cd$beta)
## [1] 0.000000000 -0.003938377 0.000000000 0.000000000 0.000000000
## [6] 0.008219268
52 / 56

Comparison

K = 1000
gdat = data.frame(
iter = rep(1:K, times = 8),
loss = c(res_subgrad1$losses[1:K], res_subgrad10$losses[1:K],
res_ista$losses[1:K], res_fista$losses[1:K],
res_drs1$losses[1:K], res_drs10$losses[1:K],
res_admm1$losses[1:K], res_cd$losses[1:K]),
method = rep(c("Subgrad α=1", "Subgrad α=10", "ISTA", "FISTA",
"DRS α=1", "DRS α=10", "ADMM ρ=1", "CD"), each = K)
)
53 / 56

Comparison

library(ggplot2)
ggplot(gdat, aes(x = iter, y = log10(loss - obj_opt))) +
geom_line(aes(color = method)) +
scale_color_hue("Method") + xlab("Iteration") + ylab("Log-error") +
theme_bw()

54 / 56

More Considerations

The previous comparison is neither rigorous nor complete, and more works need to be done:

  • The plot only shows error v.s. iteration, but not error v.s. computing time

  • The direct factorization of A=I+(α/n)XX is not optimal (recall that p>n)

  • Design of algorithms when X is sparse (i.e., operating on the xsp object)

Left as after-class exercises.

55 / 56

References

[1] Stephen Boyd et al. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning.

[2] Deng Wei et al. (2017). Parallel multi-block ADMM with O(1/k) convergence. Journal of Scientific Computing.

[3] Paul Tseng (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications.

[4] Peter Richtárik and Martin Takáč (2014). Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming.

56 / 56

Optimization

2 / 56
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow