+ - 0:00:00
Notes for current slide
Notes for next slide

Computational Statistics

Lecture 4

Yixuan Qiu

2022-09-28

1 / 54

Numerical Linear Algebra

2 / 54

Today's Topics

  • Conjugate gradient method

  • Eigenvalue computation

3 / 54

Conjugate Gradient Method

4 / 54

Direct Methods

  • All the methods introduced so far can be categorized as direct methods to solve linear systems

  • Meaning that the exact solution to Ax=b can be computed within a finite number of operations (in exact arithmetic)

  • Cons: too expensive to handle very large linear systems

5 / 54

Iterative Methods

  • Computes a sequence of approximate solutions x(k) that converges to the exact solution

  • Stops when the precision is sufficient

  • Do not request more precision than is needed

  • In each iteration, the computation is typically cheap, e.g., matrix-vector multiplication vAv

  • Especially efficient for sparse matrices

6 / 54

Conjugate Gradient Method

  • Conjugate gradient method (CG) is a very special linear system solver

  • It is a direct method used as an iterative one

  • Aims to solve the linear system Ax=b when A is positive definite (p.d.)

7 / 54

Conjugate Gradient Method

  • Only uses the matrix-vector multiplication vAv

  • Obtains the exact solution after n steps

  • Converges fast under some conditions

8 / 54

Algorithm [2]

9 / 54

Convergence [2]

In exact arithmetic, the algorithm converges to the solution of the linear system Ax=b in at most n iterations.

10 / 54

Convergence [2]

Let A be a p.d. matrix and denote by x the solution to Ax=b. Let x(k) be the sequence of approximate solutions produced by the conjugate gradient method. Then

x(k)x22κ(κ1κ+1)kx0x2, where κ is the condition number of A (defined later).

11 / 54

Condition Number

  • For any matrix A, the condition number κ(A) is defined as κ(A)=μmax(A)/μmin(A), where μmax and μmin are the largest and smallest singular values, respectively

  • For a p.d. matrix A, κ(A)=λmax(A)/λmin(A), where λmax and λmin are the largest and smallest eigenvalues, respectively

  • κ(A)1

  • For any orthogonal matrix Q, κ(Q)=1 and κ(QA)=κ(AQ)=κ(A)

12 / 54

Insights

  • We say a matrix A is well-conditioned if κ(A) is small (i.e., κ(A)1)
  • Otherwise we say A is ill-conditioned (i.e., κ(A)1)
  • If a p.d. matrix A has almost equal eigenvalues, then κ(A)1 and κ1κ+10. CG will have extremely fast convergence

  • If A almost loses positive definiteness, λmin(A)0, then κ(A)1, and CG almost has no progress

13 / 54

Implementation

  • A direct "translation" of the algorithm gives
cg = function(A, b, x0 = rep(0, length(b)), eps = 1e-6)
{
m = length(b)
x = x0
p = r = b - A %*% x
r2 = sum(r^2)
errs = c()
for(i in 1:m)
{
Ap = A %*% p
alpha = r2 / sum(p * Ap)
x = x + alpha * p
r = r - alpha * Ap
r2_new = sum(r^2)
err = sqrt(r2_new)
errs = c(errs, err)
if(err < eps)
break
beta = r2_new / r2
p = r + beta * p
r2 = r2_new
}
list(x = x, errs = errs, niter = i)
}
14 / 54

Implementation

  • We test on a simulated matrix

  • However, the algorithm does not seem to converge, and the error is quite large

set.seed(123)
n = 100
M = matrix(0.1 * rnorm(n^2), n)
A = crossprod(M)
b = rnorm(n)
sol = cg(A, b, eps = 1e-12)
sol$niter
## [1] 100
max(abs(A %*% sol$x - b))
## [1] 0.5763024
15 / 54

Implementation

  • We can find that A is ill-conditioned
kappa(A)
## [1] 1164091
  • Adding a positive number to the diagonal results in much smaller condition number
kappa(A1 <- A + diag(rep(0.1, n)))
## [1] 134.8415
kappa(A2 <- A + diag(rep(1, n)))
## [1] 15.01074
16 / 54

Implementation

sol1 = cg(A1, b, eps = 1e-12)
sol1$niter
## [1] 78
max(abs(A1 %*% sol1$x - b))
## [1] 1.588243e-13
sol2 = cg(A2, b, eps = 1e-12)
sol2$niter
## [1] 29
max(abs(A2 %*% sol2$x - b))
## [1] 1.858513e-13
17 / 54

Implementation

  • Plot the residual norms (in log-scale) against iterations

18 / 54

Implementation

  • We will see a more useful example when we consider linear regression on sparse matrices
19 / 54

When to Use

  • A is p.d. and is well-conditioned

  • The matrix-vector operation vAv is cheap

  • One example is sparse matrix, or product of sparse matrices

  • We will elaborate this point when we study sparse matrices

20 / 54

Eigenvalue Computation

21 / 54

Eigenvalues

  • Eigenvalues are important characteristics of matrices

  • Wide applications in statistics, e.g., principal component analysis (PCA), spectral clustering, matrix norm, random matrix theory, etc.

  • Eigenvalue computation is one of the core topics of numerical linear algebra

22 / 54

Definition

  • The definition of eigenvalue is actually quite simple

  • For a square matrix A, we call the solution λ to the equation Ax=λx an eigenvalue of A

  • The corresponding x vector is an eigenvector

  • Here we focus on real symmetric matrices

23 / 54

Decomposition Theorem [2]

A matrix An×n is real symmetric if and only if there exists a real orthogonal matrix Γ and real eigenvalues λ1,,λn such that A=ΓDΓ, where D=diag(λ1,,λn).

This is typically called the eigenvalue decomposition, or spectral decomposition, of A.

24 / 54

Basic Properties

Suppose that A is a real symmetric matrix

  • A is p.d. if all its eigenvalues are strictly positive

  • Ak=ΓDkΓ, k=,2,1,0,1,2,

  • det(A)=λ1λn

  • tr(A)=λ1++λn

25 / 54

Basic Properties

  • Min-Max principle λmin(A)=minx0xAxxx=minx=1 xAx λmax(A)=maxx0xAxxx=maxx=1 xAx

  • This gives PCA a nice interpretation: the first principal component maximizes the explained variance

26 / 54

Computing Eigenvalues

There are different cases of computing eigenvalues

  • Computing all eigenvalues (eigenvalue decomposition)

  • Computing the largest eigenvalue

  • Computing the eigenvalue that is closest to a given number

  • ...

27 / 54

Computing Eigenvalues

  • Eigenvalue computation is a large field in numerical linear algebra

  • The actual algorithms used in practice are usually quite involved

  • For the purpose of statistical computing, we mainly introduce the basic ideas behind these algorithms and some recommended implementations

  • In what follows, we focus on finding eigenvalues of real symmetric matrices

28 / 54

All Eigenvalues

29 / 54

Computing All Eigenvalues

The most commonly-used method to compute all eigenvalues consists of two steps:

  1. Looking for an orthogonal matrix Q such that QAQ is tridiagonal

QAQ=T=[α1β2β2α2β3βn1αn1βnβnαn]

  1. Applying the QR algorithm (introduced later) to the matrix T to compute all eigenvalues
30 / 54

The Householder (Tridiagonalization) Algorithm

  • The Householder algorithm finds a sequence of (simple) orthogonal matrices Qk such that T=(Q1Qn2)A(Q1Qn2), where T is tridiagonal

  • Q=Q1Qn2 is also orthogonal (why?)

  • A and T have the same eigenvalues (why?)

  • The motivation is that T has cheaper matrix operations

31 / 54

The Householder (Tridiagonalization) Algorithm

32 / 54

The QR Algorithm

The QR algorithm iteratively reduces the tridiagonal T into a diagonal matrix by orthogonal transformations:

  1. Initialize T(0)=T, and begin the loop k=0,1,

  2. Compute the QR decomposition of the matrix T(k): T(k)=QkRk

  3. Update T(k+1)=RkQk=QkT(k)Qk

Assume that |λ1|>>|λn|>0, and then T(k) converges to a diagonal matrix, whose diagonal elements are eigenvalues of T and A.

33 / 54

Practical Algorithm

  • The lower triangular elements Tij(k), i>j converge to zero at the rate of O(|λi/λj|) [2]
  • To accelerate the convergence, a shift can be applied to the original matrix:
  1. Compute the QR decomposition of the shifted matrix, T(k)σkIn=QkRk
  2. Update T(k+1)=RkQk+σkIn=QkT(k)Qk
  • We can take σk=(T(k))nn until Tn1,n(k)0, and then switch to σk=(T(k))n1,n1
34 / 54

Complexity

  • The Householder algorithm costs O(n3)

  • The QR decomposition of tridiagonal matrices has a specialized algorithm, which costs O(n2) instead of O(n3) (the latter for general dense matrices)

  • It can be proved that all T(k) matrices are tridiagonal

  • The number of iterations required in the QR algorithm depends on the distribution of eigenvalues

35 / 54

Largest Eigenvalue

36 / 54

Computing the Largest Eigenvalue

  • In many applications, we only need the largest eigenvalue (in absolute value)

  • It can be extended to computing the largest k eigenvalues

  • If λ1,,λn are eigenvalues of A with |λ1|>>|λn|, and γ1,,γn are the associated eigenvectors, then:

    • λ2 is the largest eigenvalue of Aλ1γ1γ1
    • λ3 is the largest eigenvalue of Aλ1γ1γ1λ2γ2γ2
    • ...
37 / 54

Power Method [2]

One widely-used and possibly the simplest method to compute the largest eigenvalue (in absolute value) is the power method

38 / 54

Convergence [2]

Assume A has eigenvalues λ1>|λ2||λn| and associated eigenvectors γ1,,γn. Also assume the initial vector satisfies x0=i=1nβiγi for some β10. Then

|ykλ1|C(|λ2|λ1)k,xkγ~1C(|λ2|λ1)k, where γ1~=±γ1.

39 / 54

Pros and Cons

  • The power method only requires the matrix-vector multiplication vAv to compute eigenvalues of A (recall the CG method)

  • This is important and useful for sparse matrices

  • However, it does not effectively use the information of past iterations

  • Also, it can only compute one eigenvalue at one time

40 / 54

Advanced Algorithms

  • Mainstream software packages typically use other advanced algorithms to compute the largest eigenvalues, e.g.,

    • Implicitly restarted Lanczos method [4]
    • Jacobi–Davidson algorithm [5]
  • They also only need the vAv operation

  • Can compute multiple eigenvalues together

  • These algorithms themselves are quite sophisticated, however

41 / 54

Eigenvalue Closest to μ

42 / 54

Computing Eigenvalue Closest to μ

  • Another type of problem is to find the eigenvalue that is closest to a given number μ

  • For example, if μ=0, then this is equivalent to finding the smallest eigenvalue (in absolute value)

  • Fortunately, we can make use of existing methods that compute largest eigenvalues for this task, via a technique called spectral transformation

43 / 54

Spectral Transformation

Spectral transformation is based on the following finding.

If λi, i=1,,n are eigenvalues of A, then Axi=λixi for the eigenvectors xi. Choose a shift μλi, then the eigenvalues of B=(AμIn)1 are θi=(λiμ)1, and Bθi=θixi.

Therefore, the largest eigenvalue of B corresponds to the eigenvalue of A that is closest to μ.

44 / 54

Inverse Iteration

  • If we want to compute the eigenvalue of A closest to μ

  • We can apply the power method (or other methods based on matrix-vector multiplication) to B=(AμIn)1

  • The iteration becomes yk=(AμIn)1xk1xk=yk/yk

  • By the convergence theorem, we have ykθ1

  • Then recover the eigenvalue of A by λμ=μ+1/θ1

45 / 54

Solving Linear Systems

  • In actual implementation, we do not directly compute the matrix inverse (AμIn)1

  • Instead, we factorize AμIn once using LU decomposition or LDLT decomposition

  • And then solve the linear systems (AμIn)yk=xk1

46 / 54

Implementation

  • The eigen() function in base R is a very stable and efficient function to compute all eigenvalues

  • Specify symmetric = TRUE if you know your matrix is symmetric (by default it will detect symmetry)

  • Specify only.values = TRUE if you do not need eigenvectors

47 / 54

Benchmark

library(dplyr)
set.seed(123)
n = 1000
M = matrix(rnorm(n^2), n)
A = M + t(M)
bench::mark(
eigen(A),
eigen(A, symmetric = FALSE),
eigen(A, symmetric = TRUE),
eigen(A, symmetric = TRUE, only.values = TRUE),
min_iterations = 3, max_iterations = 10, check = FALSE
) %>% select(expression, min, median)
## # A tibble: 4 × 3
## expression min median
## <bch:expr> <bch:tm> <bch:tm>
## 1 eigen(A) 1.15s 1.17s
## 2 eigen(A, symmetric = FALSE) 4.13s 4.14s
## 3 eigen(A, symmetric = TRUE) 1.11s 1.13s
## 4 eigen(A, symmetric = TRUE, only.values = TRUE) 314.12ms 314.19ms
48 / 54

Implementation

  • To compute a subset of eigenvalues up to some selection rule, we have several options
  • The RSpectra package implements the implicitly restarted Lanczos method
  • The PRIMME package uses the Jacobi-Davidson algorithm
  • The irlba package is designed for singular value decomposition (next class), but can also be used to compute the largest eigenvalues
  • My personal favorite is the RSpectra package (this is a highly biased choice, of course)
49 / 54

Implementation

  • The code below computes three eigenvalues with largest absolute values
library(RSpectra)
e = eigs_sym(A, k = 3, which = "LM")
e$values
## [1] 88.83776 88.31500 -88.98945
head(e$vectors)
## [,1] [,2] [,3]
## [1,] 0.01506291 0.029602478 -0.015855700
## [2,] -0.05124211 0.016404205 -0.014749483
## [3,] -0.01542738 -0.002350839 0.005520141
## [4,] 0.02921002 -0.006653817 -0.003387408
## [5,] -0.02406133 0.039802584 0.028103112
## [6,] -0.01309102 -0.026105929 0.073813795
50 / 54

Implementation

  • If you want the largest three eigenvalues without taking absolute values, use which = "LA"
e = eigs_sym(A, k = 3, which = "LA")
e$values
## [1] 88.83776 88.31500 87.18646
  • If no eigenvectors are needed, add the following option
e = eigs_sym(A, k = 3, which = "LA", opts = list(retvec = FALSE))
e$vectors
## NULL
51 / 54

Benchmark

  • If only part of the eigenvalues are needed, eigs_sym() is much faster than eigen()
bench::mark(
eigen(A, symmetric = TRUE),
eigen(A, symmetric = TRUE, only.values = TRUE),
eigs_sym(A, k = 3, which = "LM"),
eigs_sym(A, k = 3, which = "LM", opts = list(retvec = FALSE)),
min_iterations = 3, max_iterations = 10, check = FALSE
) %>% select(min, median)
## # A tibble: 4 × 2
## min median
## <bch:tm> <bch:tm>
## 1 1.12s 1.12s
## 2 310.66ms 310.66ms
## 3 149.93ms 150.31ms
## 4 149.43ms 150.85ms
  • The difference will be more evident on sparse matrices (next class)
52 / 54

Implementation

  • If one needs the smallest three eigenvalues (in absolute value)
e = eigs_sym(A, k = 3, which = "LM", sigma = 0)
e$values
## [1] 0.01589015 -0.14248784 -0.26307129
  • In general, if sigma is not NULL, then the selection rule is applied to 1/(λiσ)

  • This only affects the selection rule

  • The returned eigenvalues are still for A

53 / 54

References

[1] Folkmar Bornemann (2018). Numerical Linear Algebra. Springer.

[2] Grégoire Allaire and Sidi Mahmoud Kaber (2008). Numerical linear algebra. Springer.

[3] Åke Björck (2015). Numerical methods in matrix computations. Springer.

[4] Danny C. Sorensen (1997). Implicitly restarted Arnoldi/Lanczos methods for large scale eigenvalue calculations. In Parallel Numerical Algorithms (pp. 119-165). Springer.

[5] Gerard L.G. Sleijpen and Henk A. Van der Vorst (1996). A Jacobi–Davidson iteration method for linear eigenvalue problems. SIAM Journal on Matrix Analysis and Applications, 17(2), 401-425.

54 / 54

Numerical Linear Algebra

2 / 54
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow