Computational Statistics
Lecture 4
Yixuan Qiu
2023-09-27
1 / 54

Numerical Linear Algebra2 / 54

Today's Topics

Conjugate gradient method
Eigenvalue computation

3 / 54

Conjugate Gradient Method4 / 54

Direct Methods

All the methods introduced so far can be categorized as direct methods to solve linear systems
Meaning that the exact solution to can be computed within a finite number of operations (in exact arithmetic)
Cons: too expensive to handle very large linear systems

5 / 54

Iterative Methods

Computes a sequence of approximate solutions that converges to the exact solution
Stops when the precision is sufficient
Do not request more precision than is needed
In each iteration, the computation is typically cheap, e.g., matrix-vector multiplication
Especially efficient for sparse matrices

6 / 54

Conjugate Gradient Method

Conjugate gradient method (CG) is a very special linear system solver
It is a direct method used as an iterative one
Aims to solve the linear system when is positive definite (p.d.)

7 / 54

Conjugate Gradient Method

Only uses the matrix-vector multiplication
Obtains the exact solution after steps
Converges fast under some conditions

8 / 54

Algorithm ^[2]

9 / 54

Convergence ^[2]

In exact arithmetic, the algorithm converges to the solution of the linear system in at most iterations.

10 / 54

Convergence ^[2]

Let be a p.d. matrix and denote by the solution to . Let be the sequence of approximate solutions produced by the conjugate gradient method. Then

where is the condition number of (defined later).

11 / 54

Condition Number

For any matrix , the condition number is defined as , where and are the largest and smallest singular values, respectively
For a p.d. matrix , , where and are the largest and smallest eigenvalues, respectively
For any orthogonal matrix , and

12 / 54

Insights

We say a matrix is well-conditioned if is small (i.e., )
Otherwise we say is ill-conditioned (i.e., )

If a p.d. matrix has almost equal eigenvalues, then and CG will have extremely fast convergence
If almost loses positive definiteness, , then , and CG almost has no progress

13 / 54

Implementation

A direct "translation" of the algorithm gives

cg = function(A, b, x0 = rep(0, length(b)), eps = 1e-6)
{
    m = length(b)
    x = x0
    p = r = b - A %*% x
    r2 = sum(r^2)
    errs = c()
    for(i in 1:m)
    {
        Ap = A %*% p
        alpha = r2 / sum(p * Ap)
        x = x + alpha * p
        r = r - alpha * Ap
        r2_new = sum(r^2)
        err = sqrt(r2_new)
        errs = c(errs, err)
        if(err < eps)
            break
        beta = r2_new / r2
        p = r + beta * p
        r2 = r2_new
    }
    list(x = x, errs = errs, niter = i)
}

14 / 54

Implementation

We test on a simulated matrix
However, the algorithm does not seem to converge, and the error is quite large

set.seed(123)
n = 100
M = matrix(0.1 * rnorm(n^2), n)
A = crossprod(M)
b = rnorm(n)
sol = cg(A, b, eps = 1e-12)
sol$niter

## [1] 100

max(abs(A %*% sol$x - b))

## [1] 0.5763024

15 / 54

Implementation

We can find that is ill-conditioned

kappa(A)

## [1] 1164091

Adding a positive number to the diagonal results in much smaller condition number

kappa(A1 <- A + diag(rep(0.1, n)))

## [1] 134.8415

kappa(A2 <- A + diag(rep(1, n)))

## [1] 15.01074

16 / 54

Implementation

sol1 = cg(A1, b, eps = 1e-12)
sol1$niter

## [1] 78

max(abs(A1 %*% sol1$x - b))

## [1] 1.588243e-13

sol2 = cg(A2, b, eps = 1e-12)
sol2$niter

## [1] 29

max(abs(A2 %*% sol2$x - b))

## [1] 1.858513e-13

17 / 54

Implementation

Plot the residual norms (in log-scale) against iterations

18 / 54

ImplementationWe will see a more useful example when we consider linear regression on sparse matrices
19 / 54

When to Use

is p.d. and is well-conditioned
The matrix-vector operation is cheap
One example is sparse matrix, or product of sparse matrices
We will elaborate this point when we study sparse matrices

20 / 54

Eigenvalue Computation21 / 54

Eigenvalues

Eigenvalues are important characteristics of matrices
Wide applications in statistics, e.g., principal component analysis (PCA), spectral clustering, matrix norm, random matrix theory, etc.
Eigenvalue computation is one of the core topics of numerical linear algebra

22 / 54

Definition

The definition of eigenvalue is actually quite simple
For a square matrix , we call the solution to the equation an eigenvalue of
The corresponding vector is an eigenvector
Here we focus on real symmetric matrices

23 / 54

Decomposition Theorem ^[2]

A matrix is real symmetric if and only if there exists a real orthogonal matrix and real eigenvalues such that where .

This is typically called the eigenvalue decomposition, or spectral decomposition, of .

24 / 54

Basic Properties

Suppose that is a real symmetric matrix

is p.d. if all its eigenvalues are strictly positive
,

25 / 54

Basic Properties

Min-Max principle
This gives PCA a nice interpretation: the first principal component maximizes the explained variance

26 / 54

Computing Eigenvalues

There are different cases of computing eigenvalues

Computing all eigenvalues (eigenvalue decomposition)
Computing the largest eigenvalue
Computing the eigenvalue that is closest to a given number
...

27 / 54

Computing Eigenvalues

Eigenvalue computation is a large field in numerical linear algebra
The actual algorithms used in practice are usually quite involved
For the purpose of statistical computing, we mainly introduce the basic ideas behind these algorithms and some recommended implementations
In what follows, we focus on finding eigenvalues of real symmetric matrices

28 / 54

All Eigenvalues29 / 54

Computing All Eigenvalues

The most commonly-used method to compute all eigenvalues consists of two steps:

Looking for an orthogonal matrix such that is tridiagonal

Applying the QR algorithm (introduced later) to the matrix to compute all eigenvalues

30 / 54

The Householder (Tridiagonalization) Algorithm

The Householder algorithm finds a sequence of (simple) orthogonal matrices such that where is tridiagonal
is also orthogonal (why?)
and have the same eigenvalues (why?)
The motivation is that has cheaper matrix operations

31 / 54

The Householder (Tridiagonalization) Algorithm

A very nice intro orthogonal transformatioduction to the Householder algorithm can be found at Section 4.3 and Section 4.6.1 of https://people.inf.ethz.ch/arbenz/ewp/Lnotes/chapter4.pdf

32 / 54

The QR Algorithm

The QR algorithm iteratively reduces the tridiagonal into a diagonal matrix by orthogonal transformations:

Initialize , and begin the loop
Compute the QR decomposition of the matrix :
Update

Assume that , and then converges to a diagonal matrix, whose diagonal elements are eigenvalues of and .

33 / 54

Practical AlgorithmThe lower triangular elements T_{ij}^{(k)}, i>j converge to zero at the rate of O(|\lambda_i/\lambda_j|) [2]
To accelerate the convergence, a shift can be applied to the original matrix:
Compute the QR decomposition of the shifted matrix,
T^{(k)}-\sigma_k I_n=Q_k R_k
Update T^{(k+1)}=R_k Q_k+\sigma_k I_n=Q_k'T^{(k)}Q_k
We can take \sigma_k=(T^{(k)})_{nn} until T_{n-1,n}^{(k)}\approx 0, and then switch to \sigma_k=(T^{(k)})_{n-1,n-1}
34 / 54

Complexity

The Householder algorithm costs
The QR decomposition of tridiagonal matrices has a specialized algorithm, which costs instead of (the latter for general dense matrices)
It can be proved that all matrices are tridiagonal
The number of iterations required in the QR algorithm depends on the distribution of eigenvalues

35 / 54

Largest Eigenvalue36 / 54

Computing the Largest Eigenvalue

In many applications, we only need the largest eigenvalue (in absolute value)
It can be extended to computing the largest eigenvalues
If are eigenvalues of with , and are the associated eigenvectors, then:
- is the largest eigenvalue of
- is the largest eigenvalue of
- ...

37 / 54

Power Method ^[2]

One widely-used and possibly the simplest method to compute the largest eigenvalue (in absolute value) is the power method

38 / 54

Convergence ^[2]

Assume has eigenvalues and associated eigenvectors . Also assume the initial vector satisfies for some . Then

where .

39 / 54

Pros and Cons

The power method only requires the matrix-vector multiplication to compute eigenvalues of (recall the CG method)
This is important and useful for sparse matrices
However, it does not effectively use the information of past iterations
Also, it can only compute one eigenvalue at one time

40 / 54

Advanced Algorithms

Mainstream software packages typically use other advanced algorithms to compute the largest eigenvalues, e.g.,
- Implicitly restarted Lanczos method ^[4]
- Jacobi–Davidson algorithm ^[5]
They also only need the operation
Can compute multiple eigenvalues together
These algorithms themselves are quite sophisticated, however

41 / 54

Eigenvalue Closest to \mu42 / 54

Computing Eigenvalue Closest to

Another type of problem is to find the eigenvalue that is closest to a given number
For example, if , then this is equivalent to finding the smallest eigenvalue (in absolute value)
Fortunately, we can make use of existing methods that compute largest eigenvalues for this task, via a technique called spectral transformation

43 / 54

Spectral Transformation

Spectral transformation is based on the following finding.

If , are eigenvalues of , then for the eigenvectors . Choose a shift , then the eigenvalues of are , and .

Therefore, the largest eigenvalue of corresponds to the eigenvalue of that is closest to .

44 / 54

Inverse Iteration

If we want to compute the eigenvalue of closest to
We can apply the power method (or other methods based on matrix-vector multiplication) to
The iteration becomes
By the convergence theorem, we have
Then recover the eigenvalue of by

45 / 54

Solving Linear Systems

In actual implementation, we do not directly compute the matrix inverse
Instead, we factorize once using LU decomposition or decomposition
And then solve the linear systems

46 / 54

Implementation

The eigen() function in base R is a very stable and efficient function to compute all eigenvalues
Specify symmetric = TRUE if you know your matrix is symmetric (by default it will detect symmetry)
Specify only.values = TRUE if you do not need eigenvectors

47 / 54

Benchmark

library(dplyr)
set.seed(123)
n = 1000
M = matrix(rnorm(n^2), n)
A = M + t(M)
bench::mark(
    eigen(A),
    eigen(A, symmetric = FALSE),
    eigen(A, symmetric = TRUE),
    eigen(A, symmetric = TRUE, only.values = TRUE),
    min_iterations = 3, max_iterations = 10, check = FALSE
) %>% select(expression, min, median)

## # A tibble: 4 × 3
##   expression                                          min   median
##   <bch:expr>                                     <bch:tm> <bch:tm>
## 1 eigen(A)                                          1.22s    1.22s
## 2 eigen(A, symmetric = FALSE)                       4.32s    4.34s
## 3 eigen(A, symmetric = TRUE)                        1.18s    1.19s
## 4 eigen(A, symmetric = TRUE, only.values = TRUE) 326.67ms 330.35ms

48 / 54

ImplementationTo compute a subset of eigenvalues up to some selection rule, we have several options
The RSpectra package implements the implicitly restarted Lanczos method
The PRIMME package uses the Jacobi-Davidson algorithm
The irlba package is designed for singular value decomposition (next class), but can also be used to compute the largest eigenvalues
My personal favorite is the RSpectra package (this is a highly biased choice, of course)
49 / 54

Implementation

The code below computes three eigenvalues with largest absolute values

library(RSpectra)
e = eigs_sym(A, k = 3, which = "LM")
e$values

## [1]  88.83776  88.31500 -88.98945

head(e$vectors)

##             [,1]         [,2]         [,3]
## [1,]  0.01506291  0.029602478 -0.015855700
## [2,] -0.05124211  0.016404205 -0.014749483
## [3,] -0.01542738 -0.002350839  0.005520141
## [4,]  0.02921002 -0.006653817 -0.003387408
## [5,] -0.02406133  0.039802584  0.028103112
## [6,] -0.01309102 -0.026105929  0.073813795

50 / 54

Implementation

If you want the largest three eigenvalues without taking absolute values, use which = "LA"

e = eigs_sym(A, k = 3, which = "LA")
e$values

## [1] 88.83776 88.31500 87.18646

If no eigenvectors are needed, add the following option

e = eigs_sym(A, k = 3, which = "LA", opts = list(retvec = FALSE))
e$vectors

## NULL

51 / 54

Benchmark

If only part of the eigenvalues are needed, eigs_sym() is much faster than eigen()

bench::mark(
    eigen(A, symmetric = TRUE),
    eigen(A, symmetric = TRUE, only.values = TRUE),
    eigs_sym(A, k = 3, which = "LM"),
    eigs_sym(A, k = 3, which = "LM", opts = list(retvec = FALSE)),
    min_iterations = 3, max_iterations = 10, check = FALSE
) %>% select(min, median)

## # A tibble: 4 × 2
##        min   median
##   <bch:tm> <bch:tm>
## 1     1.2s     1.2s
## 2  330.7ms  335.7ms
## 3  158.8ms  159.2ms
## 4  157.1ms  159.3ms

The difference will be more evident on sparse matrices (next class)

52 / 54

Implementation

If one needs the smallest three eigenvalues (in absolute value)

e = eigs_sym(A, k = 3, which = "LM", sigma = 0)
e$values

## [1]  0.01589015 -0.14248784 -0.26307129

In general, if sigma is not NULL, then the selection rule is applied to
This only affects the selection rule
The returned eigenvalues are still for

53 / 54

References

[1] Folkmar Bornemann (2018). Numerical Linear Algebra. Springer.

[2] Grégoire Allaire and Sidi Mahmoud Kaber (2008). Numerical linear algebra. Springer.

[3] Åke Björck (2015). Numerical methods in matrix computations. Springer.

[4] Danny C. Sorensen (1997). Implicitly restarted Arnoldi/Lanczos methods for large scale eigenvalue calculations. In Parallel Numerical Algorithms (pp. 119-165). Springer.

[5] Gerard L.G. Sleijpen and Henk A. Van der Vorst (1996). A Jacobi–Davidson iteration method for linear eigenvalue problems. SIAM Journal on Matrix Analysis and Applications, 17(2), 401-425.

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Computational Statistics

Lecture 4

Yixuan Qiu

2023-09-27

Numerical Linear Algebra

Today's Topics

Conjugate Gradient Method

Direct Methods

Iterative Methods

Conjugate Gradient Method

Conjugate Gradient Method

Algorithm [2]

Convergence [2]

Convergence [2]

Condition Number

Insights

Implementation

Implementation

Implementation

Implementation

Implementation

Implementation

When to Use

Eigenvalue Computation

Eigenvalues

Definition

Decomposition Theorem [2]

Basic Properties

Basic Properties

Computing Eigenvalues

Computing Eigenvalues

All Eigenvalues

Computing All Eigenvalues

The Householder (Tridiagonalization) Algorithm

The Householder (Tridiagonalization) Algorithm

The QR Algorithm

Practical Algorithm

Complexity

Largest Eigenvalue

Computing the Largest Eigenvalue

Power Method [2]

Convergence [2]

Pros and Cons

Advanced Algorithms

Eigenvalue Closest to \mu

Computing Eigenvalue Closest to \mu

Spectral Transformation

Inverse Iteration

Solving Linear Systems

Implementation

Benchmark

Implementation

Implementation

Implementation

Benchmark

Implementation

References

Numerical Linear Algebra

Help

Algorithm ^[2]

Convergence ^[2]

Convergence ^[2]

Decomposition Theorem ^[2]

Power Method ^[2]

Convergence ^[2]

Eigenvalue Closest to

Computing Eigenvalue Closest to