# Distributed Statistical Model Fitting with ADMM Algorithm

Yixuan Qiu & Rongrong Zhang
2015-12-02

@ STAT 598 Computing for Big Data

### Content

• What is ADMM?
• Distributed Computing with ADMM
• Implementation in Scala/Spark
• Computing on Real Data

## What is ADMM?

### What is ADMM?

• Full name: Alternating Direction Method of Multipliers
• An algorithm for constrained convex optimization problems
• Highlighted features:
• Provides easy-to-apply solutions to many statistical models
• Can be easily incorporated into the distributed computing framework
• A good friend of statistical models

### The Problem

• ADMM is used to solve the following optimization problem:

\begin{aligned} \text{minimize}\quad & f(\theta)+g(z)\\ \text{subject to}\quad & A\theta + Bz = c \end{aligned}

• $$\theta$$ and $$z$$ are vectors to be optimized
• $$A$$, $$B$$ and $$c$$ are known coefficient matrices/vectors
• $$f$$ and $$g$$ are convex functions

• Looks bizarre? Let's see two familiar examples

### Example — Lasso

\begin{aligned} \min & f(\theta)+g(z)\\ \text{s.t.} & A\theta + Bz = c \end{aligned}

• Solve $$\;\min_{\beta}\frac{1}{2n}\Vert y-X\beta\Vert_2^2+\lambda\Vert\beta\Vert_1$$
• $$f(\beta)=\frac{1}{2n}\Vert y-X\beta\Vert_2^2$$
• $$g(z)=\lambda\Vert z\Vert_1$$
• With constraint $$\;\beta-z=0$$

### Example — Median Regression (Least Absolute Deviations)

\begin{aligned} \min & f(\theta)+g(z)\\ \text{s.t.} & A\theta + Bz = c \end{aligned}

• Solve $$\;\min_{\beta} \Vert y-X\beta \Vert_1$$
• $$f(\beta)=0$$
• $$g(z)=\Vert z\Vert_1$$
• With constraint $$\;X\beta+z=y$$

### How ADMM Works?

\begin{align} \scriptstyle \theta^{k+1} & \scriptstyle :=\underset{\theta}{\arg\min}\left(f(\theta)+\frac{\rho}{2}\Vert A\theta+Bz^{k}-c+u^{k}/\rho\Vert_{2}^{2}\right)\\ \scriptstyle z^{k+1} & \scriptstyle :=\underset{z}{\arg\min}\left(g(z)+\frac{\rho}{2}\Vert A\theta^{k+1}+Bz-c+u^{k}/\rho\Vert_{2}^{2}\right)\\ \scriptstyle u^{k+1} & \scriptstyle :=u^{k}+\rho(A\theta^{k+1}+Bz^{k+1}-c) \end{align}

• $$\rho\,$$ can be any positive number
• Usually $$\scriptstyle \min_\theta f(\theta)+\frac{\rho}{2}\Vert A\theta+w\Vert_{2}^{2}$$ and $$\scriptstyle \min_z g(z)+\frac{\rho}{2}\Vert Bz+w\Vert_{2}^{2}$$ are easier to compute than the original problem

### Examples

• Lasso

\begin{align} \scriptstyle \beta^{k+1} & \scriptstyle :=(X'X+\rho I)^{-1}(X'y+\rho z^{k}-u^{k})\\ \scriptstyle z^{k+1} & \scriptstyle :=S_{n\lambda/\rho}(\beta^{k+1}+u^k/\rho)\\ \scriptstyle u^{k+1} & \scriptstyle :=u^{k}+\rho(\beta^{k+1}-z^{k+1}) \end{align}

• Median Regression

\begin{align} \scriptstyle \beta^{k+1} & \scriptstyle :=(X'X)^{-1}X'(y+z^{k}-u^{k}/\rho)\\ \scriptstyle z^{k+1} & \scriptstyle :=S_{1/\rho}(X\beta^{k+1}-y+u^{k}/\rho)\\ \scriptstyle u^{k+1} & \scriptstyle :=u^{k}+\rho(X\beta^{k+1}-z^{k+1}-y) \end{align}

• where $$\scriptstyle S_{\kappa}(a)=\max(0,1-\kappa/|a|)\cdot a$$

## How to Parallelize?

### Statistical Models

• Usually statistical models have the following form: $\min \sum_{i=1}^N f_i(\theta)+g(\theta)$
• Here $$\,f_i(\theta)$$ are partitions of the neg-log-likelihood function: If sample is cut into $$N$$ partitions, the log-likelihood can be written as the summation of $$N$$ parts
• $$g(\theta)$$ can be some penalty terms or prior information
• Models of this form can be easily parallelized

• The previous model can be rewritten as

\begin{aligned} \min & \sum_{i=1}^N f_i(\theta_i)+g(z)\\ \text{s.t.} & \theta_i - z = 0,\ i=1,\ldots,N \end{aligned}

• The iteration steps are

\begin{align} \scriptstyle \theta_i^{k+1} & \scriptstyle :=\underset{\theta_i}{\arg\min}\left(f_i(\theta_i)+\frac{\rho}{2}\Vert \theta_i-z^{k}+u_i^{k}/\rho\Vert_{2}^{2}\right)\\ \scriptstyle z^{k+1} & \scriptstyle :=\underset{z}{\arg\min}\left(g(z)+\frac{N\rho}{2}\Vert z-\bar{\theta}^{k+1}-\bar{u}^{k}/\rho\Vert_{2}^{2}\right)\\ \scriptstyle u_i^{k+1} & \scriptstyle :=u_i^{k}+\rho(\theta_i^{k+1}-z^{k+1}) \end{align}

• A diagram of P-ADMM

## Implementation in Scala/Spark

• Currently used to solve logistic lasso regression
• Can be easily extended to general penalized likelihood model $\min_{\theta}\, l(\theta)+\lambda\Vert\theta\Vert_1$

### Abstract class

• class PADMML1
• For general L1-penalized likelihood model $\scriptstyle\min_{\theta}\, \sum_{i=1}^N l(\theta;x_i,y_i)+\lambda\Vert\theta\Vert_1$
• Implementing the following iteration algorithm \begin{align} \scriptstyle \theta_i^{k+1} & \scriptstyle :=\underset{\theta_i}{\arg\min}\left(l(\theta_i;x_i,y_i)+\frac{\rho}{2}\Vert \theta_i-z^{k}+u_i^{k}/\rho\Vert_{2}^{2}\right)\\ \scriptstyle z^{k+1} & \scriptstyle :=S_{\lambda/(\rho N)}(\bar{\theta}^{k+1}+\bar{u}^{k+1}/\rho)\\ \scriptstyle u_i^{k+1} & \scriptstyle :=u_i^{k}+\rho(\theta_i^{k+1}-z^{k+1}) \end{align}
• $$\theta$$-step is undefined (abstract)

### Derived class

• class PLogisticLasso extends PADMML1
• Solves sparse logistic regression
• Only needs to implement the function for $$\theta$$-step
• Calls the solver of logistic ridge regression

### Solver class

• class LogisticRidge
• class LogisticRidgeNative
• Solves the problem $\scriptstyle\min_{\theta}\, l(\theta)+\lambda\Vert\theta-v\Vert_2^2$
• LogisticRidge: pure Java code
• LogisticRidgeNative: using C++ code to speed up

### Example Usage

import statr.stat598bd.PLogisticLasso

// Create RDD objects x and y representing
// the data matrix and response vector
// ...

val model = new PLogisticLasso(x, y, sc)
model.set_lambda(2.0)
model.set_opts(max_iter = 500,
eps_abs = 1e-3,
eps_rel = 1e-3,
logs = true)
model.run()

val beta = model.coef


## Real Data Computing

• 1.7 million rows (text blocks) / 145 variables

### Text Classification

• Classify each text block to one or more labels (33 in total), such as date, invoice number or monetary amount
• We consider the 33rd label

### Features

• Numbers: Continuous/discrete numerical values
• Boolean: The values include YES (true) or NO (false)
• Encrypted text

### Overall Flow

• Preprocess the data by removing text and missing values, and encoding YES/NO values to 1/0
• Store the cleaned data to HDFS
• Read data from Spark and randomly split into training set (99%) and tuning set (1%)
• Standardize the training set
• Consider 10 $$\lambda$$ values ranging from $$0.001n$$ to $$0.1n$$, equally spaced in the log scale

### Overall Flow - Cont.

• Fit a logistic lasso regression for each $$\lambda$$, and calculate the log-likelihood on tuning set
• Select the “optimal” $$\lambda$$
• Obtain the final regression coefficient based on the selected $$\lambda$$

### Timings

• With 10 / 20 Executors
• Cleaning data: 41.3 s / 41.3 s
• Reading data in Spark: 23.6 s / 17.6 s
• Data standardization: 7.9 s / 7.7 s
• Setting up model: 9.4 s / 6.6 s
• Model fitting:

### Timings

• Coding time: FOREVER
“Multithreading: Theory and Practice”