Distributed Statistical Model Fitting with ADMM Algorithm

Yixuan Qiu & Rongrong Zhang
2015-12-02

@ STAT 598 Computing for Big Data

Content

  • What is ADMM?
  • Distributed Computing with ADMM
  • Implementation in Scala/Spark
  • Computing on Real Data

What is ADMM?

What is ADMM?

  • Full name: Alternating Direction Method of Multipliers
  • An algorithm for constrained convex optimization problems
  • Highlighted features:
    • Provides easy-to-apply solutions to many statistical models
    • Can be easily incorporated into the distributed computing framework
    • A good friend of statistical models

The Problem

  • ADMM is used to solve the following optimization problem:

\[ \begin{aligned} \text{minimize}\quad & f(\theta)+g(z)\\ \text{subject to}\quad & A\theta + Bz = c \end{aligned} \]

  • \( \theta \) and \( z \) are vectors to be optimized
  • \( A \), \( B \) and \( c \) are known coefficient matrices/vectors
  • \( f \) and \( g \) are convex functions

  • Looks bizarre? Let's see two familiar examples

Example — Lasso

\[ \begin{aligned} \min & f(\theta)+g(z)\\ \text{s.t.} & A\theta + Bz = c \end{aligned} \]

  • Solve \( \;\min_{\beta}\frac{1}{2n}\Vert y-X\beta\Vert_2^2+\lambda\Vert\beta\Vert_1 \)
  • \( f(\beta)=\frac{1}{2n}\Vert y-X\beta\Vert_2^2 \)
  • \( g(z)=\lambda\Vert z\Vert_1 \)
  • With constraint \( \;\beta-z=0 \)

Example — Median Regression (Least Absolute Deviations)

\[ \begin{aligned} \min & f(\theta)+g(z)\\ \text{s.t.} & A\theta + Bz = c \end{aligned} \]

  • Solve \( \;\min_{\beta} \Vert y-X\beta \Vert_1 \)
  • \( f(\beta)=0 \)
  • \( g(z)=\Vert z\Vert_1 \)
  • With constraint \( \;X\beta+z=y \)

How ADMM Works?

\[ \begin{align} \scriptstyle \theta^{k+1} & \scriptstyle :=\underset{\theta}{\arg\min}\left(f(\theta)+\frac{\rho}{2}\Vert A\theta+Bz^{k}-c+u^{k}/\rho\Vert_{2}^{2}\right)\\ \scriptstyle z^{k+1} & \scriptstyle :=\underset{z}{\arg\min}\left(g(z)+\frac{\rho}{2}\Vert A\theta^{k+1}+Bz-c+u^{k}/\rho\Vert_{2}^{2}\right)\\ \scriptstyle u^{k+1} & \scriptstyle :=u^{k}+\rho(A\theta^{k+1}+Bz^{k+1}-c) \end{align} \]

  • \( \rho\, \) can be any positive number
  • Usually \( \scriptstyle \min_\theta f(\theta)+\frac{\rho}{2}\Vert A\theta+w\Vert_{2}^{2} \) and \( \scriptstyle \min_z g(z)+\frac{\rho}{2}\Vert Bz+w\Vert_{2}^{2} \) are easier to compute than the original problem

Examples

  • Lasso

\[ \begin{align} \scriptstyle \beta^{k+1} & \scriptstyle :=(X'X+\rho I)^{-1}(X'y+\rho z^{k}-u^{k})\\ \scriptstyle z^{k+1} & \scriptstyle :=S_{n\lambda/\rho}(\beta^{k+1}+u^k/\rho)\\ \scriptstyle u^{k+1} & \scriptstyle :=u^{k}+\rho(\beta^{k+1}-z^{k+1}) \end{align} \]

  • Median Regression

\[ \begin{align} \scriptstyle \beta^{k+1} & \scriptstyle :=(X'X)^{-1}X'(y+z^{k}-u^{k}/\rho)\\ \scriptstyle z^{k+1} & \scriptstyle :=S_{1/\rho}(X\beta^{k+1}-y+u^{k}/\rho)\\ \scriptstyle u^{k+1} & \scriptstyle :=u^{k}+\rho(X\beta^{k+1}-z^{k+1}-y) \end{align} \]

  • where \( \scriptstyle S_{\kappa}(a)=\max(0,1-\kappa/|a|)\cdot a \)

How to Parallelize?

Statistical Models

  • Usually statistical models have the following form: \[ \min \sum_{i=1}^N f_i(\theta)+g(\theta) \]
  • Here \( \,f_i(\theta) \) are partitions of the neg-log-likelihood function: If sample is cut into \( N \) partitions, the log-likelihood can be written as the summation of \( N \) parts
  • \( g(\theta) \) can be some penalty terms or prior information
  • Models of this form can be easily parallelized

Parallelized ADMM

  • The previous model can be rewritten as

\[ \begin{aligned} \min & \sum_{i=1}^N f_i(\theta_i)+g(z)\\ \text{s.t.} & \theta_i - z = 0,\ i=1,\ldots,N \end{aligned} \]

  • The iteration steps are

\[ \begin{align} \scriptstyle \theta_i^{k+1} & \scriptstyle :=\underset{\theta_i}{\arg\min}\left(f_i(\theta_i)+\frac{\rho}{2}\Vert \theta_i-z^{k}+u_i^{k}/\rho\Vert_{2}^{2}\right)\\ \scriptstyle z^{k+1} & \scriptstyle :=\underset{z}{\arg\min}\left(g(z)+\frac{N\rho}{2}\Vert z-\bar{\theta}^{k+1}-\bar{u}^{k}/\rho\Vert_{2}^{2}\right)\\ \scriptstyle u_i^{k+1} & \scriptstyle :=u_i^{k}+\rho(\theta_i^{k+1}-z^{k+1}) \end{align} \]

Parallelized ADMM

  • A diagram of P-ADMM

Implementation in Scala/Spark

ADMM-Spark

  • https://github.com/yixuan/ADMM-Spark
  • Currently used to solve logistic lasso regression
  • Can be easily extended to general penalized likelihood model \[ \min_{\theta}\, l(\theta)+\lambda\Vert\theta\Vert_1 \]

Structure

Abstract class

  • class PADMML1
  • For general L1-penalized likelihood model \[ \scriptstyle\min_{\theta}\, \sum_{i=1}^N l(\theta;x_i,y_i)+\lambda\Vert\theta\Vert_1 \]
  • Implementing the following iteration algorithm \[ \begin{align} \scriptstyle \theta_i^{k+1} & \scriptstyle :=\underset{\theta_i}{\arg\min}\left(l(\theta_i;x_i,y_i)+\frac{\rho}{2}\Vert \theta_i-z^{k}+u_i^{k}/\rho\Vert_{2}^{2}\right)\\ \scriptstyle z^{k+1} & \scriptstyle :=S_{\lambda/(\rho N)}(\bar{\theta}^{k+1}+\bar{u}^{k+1}/\rho)\\ \scriptstyle u_i^{k+1} & \scriptstyle :=u_i^{k}+\rho(\theta_i^{k+1}-z^{k+1}) \end{align} \]
  • \( \theta \)-step is undefined (abstract)

Derived class

  • class PLogisticLasso extends PADMML1
  • Solves sparse logistic regression
  • Only needs to implement the function for \( \theta \)-step
  • Calls the solver of logistic ridge regression

Solver class

  • class LogisticRidge
  • class LogisticRidgeNative
  • Solves the problem \[ \scriptstyle\min_{\theta}\, l(\theta)+\lambda\Vert\theta-v\Vert_2^2 \]
  • LogisticRidge: pure Java code
  • LogisticRidgeNative: using C++ code to speed up

Example Usage

import statr.stat598bd.PLogisticLasso

// Create RDD objects x and y representing
// the data matrix and response vector
// ...

val model = new PLogisticLasso(x, y, sc)
model.set_lambda(2.0)
model.set_opts(max_iter = 500,
               eps_abs = 1e-3,
               eps_rel = 1e-3,
               logs = true)
model.run()

val beta = model.coef

Real Data Computing

Tradeshift Data

  • 1.7 million rows (text blocks) / 145 variables

Text Classification

  • Classify each text block to one or more labels (33 in total), such as date, invoice number or monetary amount
  • We consider the 33rd label

Features

  • Numbers: Continuous/discrete numerical values
  • Boolean: The values include YES (true) or NO (false)
  • Encrypted text

Overall Flow

  • Preprocess the data by removing text and missing values, and encoding YES/NO values to 1/0
  • Store the cleaned data to HDFS
  • Read data from Spark and randomly split into training set (99%) and tuning set (1%)
  • Standardize the training set
  • Consider 10 \( \lambda \) values ranging from \( 0.001n \) to \( 0.1n \), equally spaced in the log scale

Overall Flow - Cont.

  • Fit a logistic lasso regression for each \( \lambda \), and calculate the log-likelihood on tuning set
  • Select the “optimal” \( \lambda \)
  • Obtain the final regression coefficient based on the selected \( \lambda \)

Timings

  • With 10 / 20 Executors
  • Cleaning data: 41.3 s / 41.3 s
  • Reading data in Spark: 23.6 s / 17.6 s
  • Data standardization: 7.9 s / 7.7 s
  • Setting up model: 9.4 s / 6.6 s
  • Model fitting:

Timings

  • Coding time: FOREVER
“Multithreading: Theory and Practice”