Computational Statistics
Lecture 5
Yixuan Qiu
2023-10-11
1 / 67

Numerical Linear Algebra2 / 67

Today's Topics

Singular value decomposition
Sparse matrices
Case study: regression and PCA for sparse data

3 / 67

Singular Value Decomposition4 / 67

Singular Value Decomposition

Singular value decomposition (SVD) can be viewed as an extension to the eigenvalue decomposition of symmetric matrices
SVD applies to general rectangular matrices
Wide applications in dimension reduction, image compression, recommender system, etc.

5 / 67

SVD Theorem ^[2]

Every matrix $A \in R^{m \times n}$ of rank $r$ can be decomposed as

$A = U Σ V^{'} = (\begin{matrix} U_{1} & U_{2} \end{matrix}) (\begin{matrix} Σ_{1} & 0 \\ 0 & 0 \end{matrix}) (\begin{matrix} V_{1}^{'} \\ V_{2}^{'} \end{matrix}),$

where $U_{m \times m} = (u_{1}, \dots, u_{m})$ and $V_{n \times n} = (v_{1}, \dots, v_{n})$ are orthogonal matrices, $U_{1} \in R^{m \times r}$ , $V_{1} \in R^{n \times r}$ , and $Σ_{1} = d i a g (σ_{1}, \dots, σ_{r})$ is a nonnegative diagonal matrix.

$σ_{1} \geq \dots \geq σ_{r} > 0$ are called singular values of $A$ . $u_{i}, i = 1, \dots, m$ and $v_{j}, j = 1, \dots, n$ are called left and right singular vectors of $A$ , respectively.

6 / 67

Compact SVD

According to the factorization theorem, we can also write $A = U_{1} Σ_{1} V_{1}^{'},$ where $A$ is $m \times n$ , $U_{1}$ is $m \times r$ , $Σ_{1}$ is $r \times r$ , and $V_{1}$ is $n \times r$ .

$U_{1}$ and $V_{1}$ are column-orthonormal matrices.

This is typically called the compact SVD.

7 / 67

Relation to Eigenvalues

Suppose we have the (compact) SVD $A = U Σ V^{'}$ , then

$A^{'} A = V Σ U^{'} U Σ V^{'} = V Σ^{2} V^{'}$
$A A^{'} = U Σ V^{'} V Σ U^{'} = U Σ^{2} U^{'}$
Singular values of $A$ are square roots of positive eigenvalues of $A^{'} A$ and $A A^{'}$

For generality, we usually also allow singular values to be zero, by extending the $U$ , $Σ$ , and $V$ matrices to proper dimensions.

8 / 67

Partial SVD

If we only extract the largest $k$ singular values and associated singular vectors of $A$ , we call the result $U_{(k)} Σ_{(k)} V_{(k)}^{'}$ the partial SVD of $A$ .

$Σ_{(k)} = d i a g (σ_{1}, \dots, σ_{k})$ , $σ_{1} \geq \dots \geq σ_{k}$
$U_{(k)} = (u_{1}, \dots, u_{k})$
$V_{(k)} = (v_{1}, \dots, v_{k})$

9 / 67

Relation to Low-rank Approximation

Partial SVD is useful due to an interesting theorem.

Suppose that a matrix $A_{m \times n}$ , $m \geq n$ has SVD $A = U Σ V^{'}$ , where $Σ = d i a g (σ_{1}, \dots, σ_{n})$ and $σ_{1} \geq \dots \geq σ_{n} \geq 0$ . Then the best rank-k approximation to $A$ in the Frobenius norm is given by

$A_{k} = \sum_{i = 1}^{k} σ_{i} u_{i} v_{i}^{'} .$

The claim also holds with the operator norm.

10 / 67

SVD Computation11 / 67

SVD Computation

Similar to eigenvalue computation, there are different use cases of SVD:

Computing all singular values (full SVD)
Computing the largest $k$ singular values (partial SVD)

It is not common to request the smallest singular values.

12 / 67

Computing Full SVD13 / 67

Computing Full SVD

Similar to full eigenvalue decomposition, computing full SVD also typically has two steps:

Looking for orthogonal matrices $U_{1}$ and $V_{1}$ such that $U_{1}^{'} A V_{1}$ is bidiagonal $U_{1}^{'} A V_{1} = [\begin{matrix} B \\ 0 \end{matrix}], B = [\begin{matrix} ρ_{1} & θ_{2} \\ ρ_{2} & θ_{3} \\ ⋱ & ⋱ \\ ρ_{n - 1} & θ_{n} \\ ρ_{n} \end{matrix}]$
Applying the QR algorithm to the matrix $B$ to compute all singular values

14 / 67

The Golub-Kahan Householder (Bidiagonalization) Algorithm ^[2]

Any $A_{m \times n}$ with $m \geq n$ can be reduced to $U_{1}^{'} A V_{1} = [\begin{matrix} B \\ 0 \end{matrix}]$ by orthogonal matrices $U_{1} \in R^{m \times m}$ and $V_{1} \in R^{n \times n}$ . $U_{1}$ and $V_{1}$ are product of simple orthogonal matrices:

$U_{1} = Q_{1} Q_{2} \dots Q_{n}$
$V_{1} = P_{0} P_{1} \dots P_{n - 2}$

The singular values of $B$ are the same as those of $A$ .

15 / 67

The QR Algorithm

Since singular values of $B$ are square roots of eigenvalues of $B^{'} B$ , the QR algorithm can proceed as follows:

Compute the QR decomposition: $B^{(k)'} B^{(k)} = Q_{k} R_{k}$
Update $B^{(k)}$ to $B^{(k + 1)}$ such that $B^{(k + 1)'} B^{(k + 1)} = R_{k} Q_{k} = Q_{k}^{'} B^{(k)'} B^{(k)} Q_{k}$

$B^{(k)}$ is bidiagonal for all $k$ .

The QR decomposition and the update of $B^{(k)}$ would use the special structure of $B^{(k)}$ , so $B^{(k)'} B^{(k)}$ will not be explicitly formed.

16 / 67

Advanced Algorithms

In practice, there are many improvements made to this basic method
A good summary of those advanced algorithms can be found at https://www.cs.utexas.edu/users/inderjit /public_papers/HLA_SVD.pdf

17 / 67

Computing Partial SVD18 / 67

Computing Partial SVD

Partial SVD seeks the largest $k$ singular values of a matrix $A_{m \times n}$
Can be reduced to computing the largest $k$ eigenvalues of $A^{'} A$ (if $m \geq n$ ) or $A A^{'}$ (if $m < n$ )
Using the power method or other iterative methods, and assuming $m \geq n$ , we need to realize the operation $v \to A^{'} A v$
Be cautious of the computing order! You should compute $x \leftarrow A v$ first and then do $y \leftarrow A^{'} x$

19 / 67

Advanced Algorithms

Another algorithm to compute partial SVD is the augmented implicitly restarted Lanczos bidiagonalization method ^[4]
It works on the original matrix $A$ instead of $A^{'} A$
Potentially more numerical stable than computing the eigenvalues of $A^{'} A$

20 / 67

Implementation

Simply call svd() for full (but compact) SVD
You can specify how many left/right singular vectors to return

set.seed(123)
n = 1000
p = 500
x = matrix(rnorm(n * p), n, p)
str(svd(x))

## List of 3
##  $ d: num [1:500] 53.6 53.3 53.3 53 52.6 ...
##  $ u: num [1:1000, 1:500] 0.00265 -0.03029 0.00451 -0.00763 -0.0034 ...
##  $ v: num [1:500, 1:500] 0.01699 0.00386 0.00971 -0.01931 0.01455 ...

str(svd(x, nu = 10, nv = 10))

## List of 3
##  $ d: num [1:500] 53.6 53.3 53.3 53 52.6 ...
##  $ u: num [1:1000, 1:10] 0.00265 -0.03029 0.00451 -0.00763 -0.0034 ...
##  $ v: num [1:500, 1:10] 0.01699 0.00386 0.00971 -0.01931 0.01455 ...

21 / 67

Benchmark

Use nu = 0 and nv = 0 if you do not need singular vectors

library(dplyr)
bench::mark(
    svd(x),
    svd(x, nu = 100, nv = 100),
    svd(x, nu = 1, nv = 1),
    svd(x, nu = 0, nv = 0),
    min_iterations = 3, max_iterations = 10, check = FALSE
) %>% select(expression, min, median)

## # A tibble: 4 × 3
##   expression                      min   median
##   <bch:expr>                 <bch:tm> <bch:tm>
## 1 svd(x)                        785ms    795ms
## 2 svd(x, nu = 100, nv = 100)    784ms    786ms
## 3 svd(x, nu = 1, nv = 1)        786ms    788ms
## 4 svd(x, nu = 0, nv = 0)        272ms    275ms

22 / 67

Implementation

There are various options for partial SVD
- irlba::irlba
- RSpectra::svds
- svd::propack.svd, svd::trlan.svd, and svd::ztrlan.svd

res1 = irlba::irlba(x, 10)
res1$d

##  [1] 53.56025 53.30589 53.25349 53.03145 52.56782 52.30170 52.13693 51.85140
##  [9] 51.64052 51.42326

res2 = RSpectra::svds(x, k = 10)
res2$d

##  [1] 53.56025 53.30589 53.25349 53.03145 52.56782 52.30170 52.13693 51.85140
##  [9] 51.64052 51.42326

23 / 67

Implementation

res3 = svd::propack.svd(x, neig = 10)
res3$d

##  [1] 53.56025 53.30589 53.25349 53.03145 52.56782 52.30170 52.13693 51.85140
##  [9] 51.64052 51.42326

res4 = svd::trlan.svd(x, neig = 10)
res4$d

##  [1] 53.56025 53.30589 53.25349 53.03145 52.56782 52.30170 52.13693 51.85140
##  [9] 51.64052 51.42326

res5 = svd::ztrlan.svd(x, neig = 10)
res5$d

##  [1] 53.56025 53.30589 53.25349 53.03145 52.56782 52.30170 52.13693 51.85140
##  [9] 51.64052 51.42326

24 / 67

Benchmark

irlba::irlba, RSpectra::svds, and svd::propack.svd are roughly at the same level

bench::mark(
    irlba::irlba(x, 10, tol = 1e-8),
    RSpectra::svds(x, k = 10, opts = list(tol = 1e-8)),
    svd::propack.svd(x, neig = 10, opts = list(tol = 1e-8)),
    svd::trlan.svd(x, neig = 10, opts = list(tol = 1e-8)),
    svd::ztrlan.svd(x, neig = 10, opts = list(tol = 1e-8)),
    min_iterations = 10, max_iterations = 10, check = FALSE
) %>% select(expression, min, median)

## # A tibble: 5 × 3
##   expression                                                    min   median
##   <bch:expr>                                               <bch:tm> <bch:tm>
## 1 irlba::irlba(x, 10, tol = 1e-08)                           97.7ms  109.8ms
## 2 RSpectra::svds(x, k = 10, opts = list(tol = 1e-08))        88.4ms   89.8ms
## 3 svd::propack.svd(x, neig = 10, opts = list(tol = 1e-08))   99.2ms  102.3ms
## 4 svd::trlan.svd(x, neig = 10, opts = list(tol = 1e-08))    241.6ms  248.7ms
## 5 svd::ztrlan.svd(x, neig = 10, opts = list(tol = 1e-08))   400.8ms  406.1ms

25 / 67

ImplementationWe will see how partial SVD can be used to efficiently compute PCA of sparse data
26 / 67

Sparse Matrix27 / 67

Sparse Matrix

A matrix that has very few nonzero elements.

Mathematically, not very different from dense matrices
But sparsity can make a huge impact on computing

28 / 67

Problems

We are mostly interested in the following problems:

How to store sparse data/matrices
How to implement basic matrix operations on sparse matrices
How to do statistical computing on sparse matrices

29 / 67

Storage

Two major storage formats:

Coordinate (COO) format
Compressed sparse column (CSC) format

30 / 67

COO Format

Simplest storage scheme
Row indices, column indices, nonzero elements

$A = [\begin{matrix} 1 & 0 & 0 & 0 & 0 \\ 3 & 0 & 0 & 5 & 0 \\ 0 & 0 & 7 & 8 & 9 \\ 0 & 0 & 2 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{matrix}]$

$\begin{aligned} i & = [\begin{matrix} 1 & 2 & 3 & 4 & 2 & 3 & 4 & 3 \end{matrix}] \\ j & = [\begin{matrix} 1 & 1 & 3 & 3 & 4 & 4 & 4 & 5 \end{matrix}] \\ x & = [\begin{matrix} 1 & 3 & 7 & 2 & 5 & 8 & 1 & 9 \end{matrix}] \end{aligned}$

31 / 67

COO Format

For an $m \times n$ matrix with $n n z$ nonzero elements
COO format needs to store $3 \cdot n n z$ numbers
However, the $j$ vector contains redundant information
The CSC format is a more economic scheme to represent sparse matrices

32 / 67

CSC Format

The $x$ vector stores elements in $A$ column by column
Use a "pointer vector" $p$ , where $p_{j + 1} - p_{j}$ is the number of nonzero elements in column $j$

$A = [\begin{matrix} 1 & 0 & 0 & 0 & 0 \\ 3 & 0 & 0 & 5 & 0 \\ 0 & 0 & 7 & 8 & 9 \\ 0 & 0 & 2 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{matrix}]$

$\begin{aligned} i & = [\begin{matrix} 1 & 2 & 3 & 4 & 2 & 3 & 4 & 3 \end{matrix}] \\ j & = [\begin{matrix} 1 & 1 & 3 & 3 & 4 & 4 & 4 & 5 \end{matrix}] \\ p & = [\begin{matrix} 0 & 2 & 2 & 4 & 7 & 8 \end{matrix}] \\ x & = [\begin{matrix} 1 & 3 & 7 & 2 & 5 & 8 & 1 & 9 \end{matrix}] \end{aligned}$

33 / 67

CSC Format

The CSC format only needs to store the $i$ , $p$ , and $x$ vectors
$2 \cdot n n z + n + 1$ numbers in total

34 / 67

CSC Format

The CSC format only needs to store the $i$ , $p$ , and $x$ vectors
$2 \cdot n n z + n + 1$ numbers in total
Question: If we also know that the sparse matrix is binary, i.e., every element is either 0 or 1, then what is a good scheme for storage?

34 / 67

Basic Operations

Elementwise operations are relatively simple
One of the most important operations for linear systems and eigenvalue computation is the matrix-vector multiplication
$v \to A v$ and/or $v \to A^{'} v$

35 / 67

Computing $A^{'} v$

The key point is to extract each column of $A$
Each column can be viewed as a sparse vector

$A = [\begin{matrix} 1 & 0 & 0 & 0 & 0 \\ 3 & 0 & 0 & 5 & 0 \\ 0 & 0 & 7 & 8 & 9 \\ 0 & 0 & 2 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{matrix}] = [\begin{matrix} a_{1} & a_{2} & a_{3} & a_{4} & a_{5} \end{matrix}]$

$A^{'} v = {[\begin{matrix} a_{1}^{'} v & a_{2}^{'} v & a_{3}^{'} v & a_{4}^{'} v & a_{5}^{'} v \end{matrix}]}^{'}$

36 / 67

Extract ajajajaj has (pj+1−pj)(pj+1−pj) nonzero elements
The (pj+1)(pj+1)-th element of xx is the first nonzero element of ajaj
The (pj+1)(pj+1)-th element of ii is the location of this element in the sparse vector
The (pj+2)(pj+2)-th element of xx is the second nonzero element of ajaj
The (pj+2)(pj+2)-th element of ii is the location of this element in the sparse vector
...
37 / 67

Extract $a_{j}$

$A = [\begin{matrix} 1 & 0 & 0 & 0 & 0 \\ 3 & 0 & 0 & 5 & 0 \\ 0 & 0 & 7 & 8 & 9 \\ 0 & 0 & 2 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{matrix}]$

$\begin{aligned} i & = [\begin{matrix} 1 & 2 & 3 & 4 & 2 & 3 & 4 & 3 \end{matrix}] \\ p & = [\begin{matrix} 0 & 2 & 2 & 4 & 7 & 8 \end{matrix}] \\ x & = [\begin{matrix} 1 & 3 & 7 & 2 & 5 & 8 & 1 & 9 \end{matrix}] \end{aligned}$

For example, to get the third column, $a_{3}$
$p_{4} - p_{3} = 2$ , meaning $a_{3}$ has two nonzero elements
$p_{3} + 1 = 3$ , $x_{3} = 7$ , $i_{3} = 3$
$p_{3} + 2 = 4$ , $x_{4} = 2$ , $i_{4} = 4$
$a_{3}$ can be expressed as ${3 \to 7, 4 \to 2}$

38 / 67

Computing $A^{'} v$

Similarly, we can get $\begin{aligned} a_{1} & = {1 \to 1, 2 \to 3} \\ a_{2} & = {} \\ a_{3} & = {3 \to 7, 4 \to 2} \\ a_{4} & = {2 \to 5, 3 \to 8, 4 \to 1} \\ a_{5} & = {3 \to 9} \end{aligned}$
Then the inner products only involve nonzero elements $\begin{aligned} a_{1}^{'} v & = 1 \cdot v_{1} + 3 \cdot v_{2} \\ a_{2}^{'} v & = 0 \\ a_{3}^{'} v & = 7 \cdot v_{3} + 2 \cdot v_{4} \\ \dots \end{aligned}$

39 / 67

Computing $A v$

We can again use the sparse vectors $a_{j}$ to compute $A v$

$A v = v_{1} a_{1} + \dots + v_{5} a_{5}$

Each $v_{j} a_{j}$ is still a sparse vector formed by multiplying each nonzero element of $a_{j}$ by $v_{j}$

40 / 67

Computing $A v$

Therefore, computing $A v$ reduces to:
1. Initialize $r \leftarrow 0_{m}$
2. For $j = 1, \dots, n$ do $r \leftarrow r + v_{j} a_{j}$
For example, $a_{3} = {3 \to 7, 4 \to 2}$
Then $r \leftarrow r + v_{3} a_{3}$ expands to

$\begin{aligned} r_{3} & \leftarrow r_{3} + 7 \cdot v_{3} \\ r_{4} & \leftarrow r_{4} + 2 \cdot v_{4} \end{aligned}$

41 / 67

Computational Complexity

In general, for matrix $A_{m \times n}$ with $n n z$ nonzero elements
The complexity of $v \to A v$ and $v \to A^{'} v$ are both $O (n n z)$
This makes many iterative methods very efficient on sparse matrices

42 / 67

Implementation

The Matrix R package has nice support for different kinds of sparse matrices
A simple way to construct sparse matrices is to create from an (i, j, x) triplet (COO format)
The result is typically in CSC format (dgCMatrix)

library(Matrix)
i = c(1, 2, 3, 4, 2, 3, 4, 3)
j = c(1, 1, 3, 3, 4, 4, 4, 5)
x = c(1, 3, 7, 2, 5, 8, 1, 9)
xsp = sparseMatrix(i = i, j = j, x = x, dims = c(5, 5))
xsp

## 5 x 5 sparse Matrix of class "dgCMatrix"
##               
## [1,] 1 . . . .
## [2,] 3 . . 5 .
## [3,] . . 7 8 9
## [4,] . . 2 1 .
## [5,] . . . . .

43 / 67

Implementation

Note that internally, the $i$ vector uses 0-based indices

xsp

## 5 x 5 sparse Matrix of class "dgCMatrix"
##               
## [1,] 1 . . . .
## [2,] 3 . . 5 .
## [3,] . . 7 8 9
## [4,] . . 2 1 .
## [5,] . . . . .

str(xsp)

## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:8] 0 1 2 3 1 2 3 2
##   ..@ p       : int [1:6] 0 2 2 4 7 8
##   ..@ Dim     : int [1:2] 5 5
##   ..@ Dimnames:List of 2
##   .. ..$ : NULL
##   .. ..$ : NULL
##   ..@ x       : num [1:8] 1 3 7 2 5 8 1 9
##   ..@ factors : list()

44 / 67

Implementation

Another way to construct sparse matrices is to convert from dense matrices

set.seed(123)
n = 1000
nnz = floor(0.01 * n^2)
xdata = numeric(n^2)
xdata[sample(n^2, nnz)] = rnorm(nnz)
x = matrix(xdata, n, n)
class(x)

## [1] "matrix" "array"

xsp = as(x, "sparseMatrix")
class(xsp)

## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

45 / 67

Benchmark

Matrix multiplication operations are already defined for sparse matrices

bench::mark(
    x %*% x, crossprod(x),
    xsp %*% xsp, crossprod(xsp),
    min_iterations = 10, max_iterations = 10, check = FALSE
) %>% select(expression, min, median)

## # A tibble: 4 × 3
##   expression          min   median
##   <bch:expr>     <bch:tm> <bch:tm>
## 1 x %*% x        488.19ms 493.53ms
## 2 crossprod(x)    482.8ms 493.61ms
## 3 xsp %*% xsp       1.3ms   1.53ms
## 4 crossprod(xsp)   1.48ms   1.58ms

46 / 67

Sparse Matrix Computation47 / 67

Direct Methods

Direct methods for sparse matrices are typically difficult problems
Preservation of sparsity is a big challenge
If $A$ is sparse, $A = L U$ , $L$ and $U$ are not necessarily sparse
Therefore, proper ordering of rows/columns of $A$ is critical
With some permutation matrices $P$ and $Q$ , $P A Q = \tilde{L} \tilde{U}$ , $\tilde{L}$ and $\tilde{U}$ may be much more sparse than $L$ and $U$

48 / 67

Direct Methods

There are indeed sparse decomposition methods, e.g.,
- Sparse LU decomposition
- Sparse Cholesky decomposition
They are typically very sophisticated
Software support such as SuiteSparse

49 / 67

Iterative Methods

We are mostly interested in iterative methods for sparse matrices
Small memory cost
Computationally efficient
Easy to implement

50 / 67

Core Idea

Iterative methods that rely on $v \to A v$ , $v \to A^{'} v$ , or some other minor operations are readily available for sparse matrices
Conjugate gradient method
Iterative eigenvalue algorithms such as power method
Partial SVD

51 / 67

Case Studies52 / 67

Statistical Computing on Sparse Matrix

Regression
Ridge regression
PCA
Lasso (preview)

53 / 67

Regression on Sparse Data

Suppose the data matrix $X$ is sparse
Target is $\hat{β} = (X^{'} X)^{- 1} X^{'} Y$
However, $X^{'} X$ may no longer be sparse
Also, $X$ typically does not have a "specialized" QR decomposition
Therefore, there is no obvious benefit of the sparsity (except for computing $X^{'} X$ )

54 / 67

Regression on Sparse Data

In contrast, CG can effectively utilize the sparsity
CG solves $A x = b$ , where $A = X^{'} X$ and $b = X^{'} Y$
$A$ is never explictly formed
CG requires the operation $v \to A v = X^{'} X v$
Compute two sparse matrix-vector multiplications
1. $u \leftarrow X v$
2. $w \leftarrow X^{'} u$

55 / 67

Implementation

First simulate sparse data and use the dense solver to get the result

library(Matrix)
set.seed(123)
n = 10000
p = 500
xsp = rsparsematrix(n, p, density = 0.001)
x = as.matrix(xsp)
y = rnorm(n)
bhat = solve(crossprod(x), crossprod(x, y))
bhat[1:5]

## [1] -0.003835525 -0.211645950  0.102156056  0.294562285  0.031367565

56 / 67

Implementation

Slight modification of the CG code

reg_cg = function(Xsp, y, x0 = rep(0, ncol(Xsp)), eps = 1e-6)
{
    b = as.numeric(crossprod(Xsp, y))
    x = x0
    p = r = b - as.numeric(crossprod(Xsp, Xsp %*% x))
    r2 = sum(r^2)
    errs = c()
    for(i in seq_along(b))
    {
        Ap = as.numeric(crossprod(Xsp, Xsp %*% p))
        alpha = r2 / sum(p * Ap)
        x = x + alpha * p
        r = r - alpha * Ap
        r2_new = sum(r^2)
        err = sqrt(r2_new)
        errs = c(errs, err)
        if(err < eps)
            break
        beta = r2_new / r2
        p = r + beta * p
        r2 = r2_new
    }
    list(beta_hat = x, errs = errs, niter = i)
}

57 / 67

Implementation

Verify the result

cg = reg_cg(xsp, y, eps = 1e-6)
bhat_cg = cg$beta_hat
cg$niter

## [1] 56

bhat_cg[1:5]

## [1] -0.003835524 -0.211645950  0.102156055  0.294562283  0.031367563

max(abs(bhat_cg - bhat))

## [1] 1.548608e-07

58 / 67

Benchmark

Significant speedup over dense operations

bench::mark(
    solve(crossprod(x), crossprod(x, y)),
    reg_cg(xsp, y),
    min_iterations = 3, max_iterations = 10, check = FALSE
) %>% select(expression, min, median)

## # A tibble: 2 × 3
##   expression                                min   median
##   <bch:expr>                           <bch:tm> <bch:tm>
## 1 solve(crossprod(x), crossprod(x, y))    1.37s    1.37s
## 2 reg_cg(xsp, y)                         3.83ms   5.06ms

59 / 67

Ridge Regression on Sparse Data

Ridge regression is similar, but now $A = X^{'} X + λ I$ and $b = X^{'} Y$
Again, $A$ should not be explictly formed
Now we need the operation $v \to A v = (X^{'} X + λ I) v$
Compute two sparse matrix-vector multiplications and one vector addition
1. $u \leftarrow X v$
2. $w \leftarrow X^{'} u$
3. $z \leftarrow w + λ v$

60 / 67

Excercises

Implement the ridge regression for a sparse $X$ with $n = 10000$ and $p = 500$
What if $p ≫ n$ ? For example, $X$ is a sparse matrix with $n = 500$ and $p = 10000$
We can find that the majority part of the CG code is unchanged in different problems. Design an implementation of CG that is as general as possible. Hint: pass a function Ax as the first argument of CG, where Ax has the signature
```
Ax = function(x, args) {...}
```
Ax should implement the operation $v \to A v$ , and args contains information about $A .$

61 / 67

PCA on Sparse Data

Another common task for large data is computing PCA
Equivalent to computing the largest $k$ eigenvalues of the sample covariance matrix
An obvious way is to first compute $S = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i \cdot} - \bar{x}) (x_{i \cdot} - \bar{x})^{'},$ where $x_{i \cdot} \in R^{p}$ is the $i$ -th observation, and $\bar{x} \in R^{p}$ is the sample mean vector
Then compute eigenvalues of $S$

62 / 67

PCA on Sparse Data

However, this does not fully utilize the sparsity
In fact, $S$ is generally a dense matrix
A better scheme is to observe that $S = \frac{1}{n - 1} (X - 1_{n} {\bar{x}}^{'})^{'} (X - 1_{n} {\bar{x}}^{'}),$ where $X$ is the data matrix, and $1_{n}$ is a vector of ones
Ignoring the $(n - 1)^{- 1}$ factor, we only need to compute the eigenvalues of $A^{'} A$ , where $A = X - 1_{n} {\bar{x}}^{'}$
Also equivalent to the partial SVD of $A$

63 / 67

PCA on Sparse Data

Then we need to implement the operations $v \to A v$ and $v \to A^{'} v$
As a first step, we need to compute $\bar{x}$ . This is relatively easy since we have shown how to extract columns of $X$
Then to do $v \to A v = (X - 1_{n} {\bar{x}}^{'}) v$ :
1. $u \leftarrow X v$
2. $s \leftarrow {\bar{x}}^{'} v$
3. $w \leftarrow u - s 1_{n}$

64 / 67

PCA on Sparse Data

Similarly, to do $v \to A^{'} v = (X - 1_{n} {\bar{x}}^{'})^{'} v$ :
1. $u \leftarrow X^{'} v$
2. $s \leftarrow 1^{'} v$
3. $w \leftarrow u - s \bar{x}$

65 / 67

Implementation and Benchmark

See https://statr.me/2019/11/rspectra-center-scale/ for more details
Especially the center and scale parameters in the RSpectra::svds function

66 / 67

References

[1] Grégoire Allaire and Sidi Mahmoud Kaber (2008). Numerical linear algebra. Springer.

[2] Åke Björck (2015). Numerical methods in matrix computations. Springer.

[3] Gene H. Golub and William Kahan (1965). Calculating the singular values and pseudoinverse of a matrix. Journal of SIAM: Series B, Numerical Analysis, 205–224.

[4] James Baglama and Lothar Reichel (2005). Augmented implicitly restarted Lanczos bidiagonalization methods. SIAM Journal on Scientific Computing, 27(1), 19-42.

[5] Iain S. Duff, Albert M. Erisman, and John K. Reid (2017). Direct methods for sparse matrices. Oxford University Press.

[6] Yousef Saad (2003). Iterative methods for sparse linear systems. Society for Industrial and Applied Mathematics.

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Computational Statistics

Lecture 5

Yixuan Qiu

2023-10-11

Numerical Linear Algebra

Today's Topics

Singular Value Decomposition

Singular Value Decomposition

SVD Theorem [2]

Compact SVD

Relation to Eigenvalues

Partial SVD

Relation to Low-rank Approximation

SVD Computation

SVD Computation

Computing Full SVD

Computing Full SVD

The Golub-Kahan Householder (Bidiagonalization) Algorithm [2]

The QR Algorithm

Advanced Algorithms

Computing Partial SVD

Computing Partial SVD

Advanced Algorithms

Implementation

Benchmark

Implementation

Implementation

Benchmark

Implementation

Sparse Matrix

Sparse Matrix

Problems

Storage

COO Format

COO Format

CSC Format

CSC Format

CSC Format

Basic Operations

Computing A′vA′v

Extract ajaj

Extract ajaj

Computing A′vA′v

Computing AvAv

Computing AvAv

Computational Complexity

Implementation

Implementation

Implementation

Benchmark

Sparse Matrix Computation

Direct Methods

Direct Methods

Iterative Methods

Core Idea

Case Studies

Statistical Computing on Sparse Matrix

Regression on Sparse Data

Regression on Sparse Data

Implementation

Implementation

Implementation

Benchmark

Ridge Regression on Sparse Data

Excercises

PCA on Sparse Data

PCA on Sparse Data

PCA on Sparse Data

PCA on Sparse Data

Implementation and Benchmark

References

Numerical Linear Algebra

Help

SVD Theorem ^[2]

The Golub-Kahan Householder (Bidiagonalization) Algorithm ^[2]

Computing $A^{'} v$

Extract $a_{j}$

Extract $a_{j}$

Computing $A^{'} v$

Computing $A v$

Computing $A v$