Computational Statistics
Lecture 4
Yixuan Qiu
2022-09-28
1 / 54

Numerical Linear Algebra2 / 54

Today's Topics

Conjugate gradient method
Eigenvalue computation

3 / 54

Conjugate Gradient Method4 / 54

Direct Methods

All the methods introduced so far can be categorized as direct methods to solve linear systems
Meaning that the exact solution to $A x = b$ can be computed within a finite number of operations (in exact arithmetic)
Cons: too expensive to handle very large linear systems

5 / 54

Iterative Methods

Computes a sequence of approximate solutions $x^{(k)}$ that converges to the exact solution
Stops when the precision is sufficient
Do not request more precision than is needed
In each iteration, the computation is typically cheap, e.g., matrix-vector multiplication $v \to A v$
Especially efficient for sparse matrices

6 / 54

Conjugate Gradient Method

Conjugate gradient method (CG) is a very special linear system solver
It is a direct method used as an iterative one
Aims to solve the linear system $A x = b$ when $A$ is positive definite (p.d.)

7 / 54

Conjugate Gradient Method

Only uses the matrix-vector multiplication $v \to A v$
Obtains the exact solution after $n$ steps
Converges fast under some conditions

8 / 54

Algorithm ^[2]

9 / 54

Convergence ^[2]

In exact arithmetic, the algorithm converges to the solution of the linear system $A x = b$ in at most $n$ iterations.

10 / 54

Convergence ^[2]

Let $A$ be a p.d. matrix and denote by $x^{*}$ the solution to $A x = b$ . Let $x^{(k)}$ be the sequence of approximate solutions produced by the conjugate gradient method. Then

$‖ x^{(k)} - x^{*} ‖_{2} \leq 2 \sqrt{κ} {(\frac{\sqrt{κ} - 1}{\sqrt{κ} + 1})}^{k} ‖ x_{0} - x^{*} ‖_{2},$ where $κ$ is the condition number of $A$ (defined later).

11 / 54

Condition Number

For any matrix $A$ , the condition number $κ (A)$ is defined as $κ (A) = μ_{max} (A) / μ_{min} (A)$ , where $μ_{max}$ and $μ_{min}$ are the largest and smallest singular values, respectively
For a p.d. matrix $A$ , $κ (A) = λ_{max} (A) / λ_{min} (A)$ , where $λ_{max}$ and $λ_{min}$ are the largest and smallest eigenvalues, respectively
$κ (A) \geq 1$
For any orthogonal matrix $Q$ , $κ (Q) = 1$ and $κ (Q A) = κ (A Q) = κ (A)$

12 / 54

Insights

We say a matrix $A$ is well-conditioned if $κ (A)$ is small (i.e., $κ (A) \approx 1$ )
Otherwise we say $A$ is ill-conditioned (i.e., $κ (A) ≫ 1$ )

If a p.d. matrix $A$ has almost equal eigenvalues, then $κ (A) \approx 1$ and $\frac{\sqrt{κ} - 1}{\sqrt{κ} + 1} \approx 0.$ CG will have extremely fast convergence
If $A$ almost loses positive definiteness, $λ_{min} (A) \approx 0$ , then $κ (A) ≫ 1$ , and CG almost has no progress

13 / 54

Implementation

A direct "translation" of the algorithm gives

cg = function(A, b, x0 = rep(0, length(b)), eps = 1e-6)
{
    m = length(b)
    x = x0
    p = r = b - A %*% x
    r2 = sum(r^2)
    errs = c()
    for(i in 1:m)
    {
        Ap = A %*% p
        alpha = r2 / sum(p * Ap)
        x = x + alpha * p
        r = r - alpha * Ap
        r2_new = sum(r^2)
        err = sqrt(r2_new)
        errs = c(errs, err)
        if(err < eps)
            break
        beta = r2_new / r2
        p = r + beta * p
        r2 = r2_new
    }
    list(x = x, errs = errs, niter = i)
}

14 / 54

Implementation

We test on a simulated matrix
However, the algorithm does not seem to converge, and the error is quite large

set.seed(123)
n = 100
M = matrix(0.1 * rnorm(n^2), n)
A = crossprod(M)
b = rnorm(n)
sol = cg(A, b, eps = 1e-12)
sol$niter

## [1] 100

max(abs(A %*% sol$x - b))

## [1] 0.5763024

15 / 54

Implementation

We can find that $A$ is ill-conditioned

kappa(A)

## [1] 1164091

Adding a positive number to the diagonal results in much smaller condition number

kappa(A1 <- A + diag(rep(0.1, n)))

## [1] 134.8415

kappa(A2 <- A + diag(rep(1, n)))

## [1] 15.01074

16 / 54

Implementation

sol1 = cg(A1, b, eps = 1e-12)
sol1$niter

## [1] 78

max(abs(A1 %*% sol1$x - b))

## [1] 1.588243e-13

sol2 = cg(A2, b, eps = 1e-12)
sol2$niter

## [1] 29

max(abs(A2 %*% sol2$x - b))

## [1] 1.858513e-13

17 / 54

Implementation

Plot the residual norms (in log-scale) against iterations

18 / 54

ImplementationWe will see a more useful example when we consider linear regression on sparse matrices
19 / 54

When to Use

$A$ is p.d. and is well-conditioned
The matrix-vector operation $v \to A v$ is cheap
One example is sparse matrix, or product of sparse matrices
We will elaborate this point when we study sparse matrices

20 / 54

Eigenvalue Computation21 / 54

Eigenvalues

Eigenvalues are important characteristics of matrices
Wide applications in statistics, e.g., principal component analysis (PCA), spectral clustering, matrix norm, random matrix theory, etc.
Eigenvalue computation is one of the core topics of numerical linear algebra

22 / 54

Definition

The definition of eigenvalue is actually quite simple
For a square matrix $A$ , we call the solution $λ$ to the equation $A x = λ x$ an eigenvalue of $A$
The corresponding $x$ vector is an eigenvector
Here we focus on real symmetric matrices

23 / 54

Decomposition Theorem ^[2]

A matrix $A_{n \times n}$ is real symmetric if and only if there exists a real orthogonal matrix $Γ$ and real eigenvalues $λ_{1}, \dots, λ_{n}$ such that $A = Γ D Γ^{'},$ where $D = d i a g (λ_{1}, \dots, λ_{n})$ .

This is typically called the eigenvalue decomposition, or spectral decomposition, of $A$ .

24 / 54

Basic Properties

Suppose that $A$ is a real symmetric matrix

$A$ is p.d. if all its eigenvalues are strictly positive
$A^{k} = Γ D^{k} Γ^{'}$ , $k = \dots, - 2, - 1, 0, 1, 2, \dots$
$det (A) = λ_{1} \dots λ_{n}$
$t r (A) = λ_{1} + \dots + λ_{n}$

25 / 54

Basic Properties

Min-Max principle $λ_{min} (A) = min_{x \neq 0} \frac{x^{'} A x}{x^{'} x} = min_{‖ x ‖ = 1} x^{'} A x$ $λ_{max} (A) = max_{x \neq 0} \frac{x^{'} A x}{x^{'} x} = max_{‖ x ‖ = 1} x^{'} A x$
This gives PCA a nice interpretation: the first principal component maximizes the explained variance

26 / 54

Computing Eigenvalues

There are different cases of computing eigenvalues

Computing all eigenvalues (eigenvalue decomposition)
Computing the largest eigenvalue
Computing the eigenvalue that is closest to a given number
...

27 / 54

Computing Eigenvalues

Eigenvalue computation is a large field in numerical linear algebra
The actual algorithms used in practice are usually quite involved
For the purpose of statistical computing, we mainly introduce the basic ideas behind these algorithms and some recommended implementations
In what follows, we focus on finding eigenvalues of real symmetric matrices

28 / 54

All Eigenvalues29 / 54

Computing All Eigenvalues

The most commonly-used method to compute all eigenvalues consists of two steps:

Looking for an orthogonal matrix $Q$ such that $Q^{'} A Q$ is tridiagonal

$Q^{'} A Q = T = [\begin{matrix} α_{1} & β_{2} \\ β_{2} & α_{2} & β_{3} \\ ⋱ & ⋱ & ⋱ \\ β_{n - 1} & α_{n - 1} & β_{n} \\ β_{n} & α_{n} \end{matrix}]$

Applying the QR algorithm (introduced later) to the matrix $T$ to compute all eigenvalues

30 / 54

The Householder (Tridiagonalization) Algorithm

The Householder algorithm finds a sequence of (simple) orthogonal matrices $Q_{k}$ such that $T = (Q_{1} \dots Q_{n - 2})^{'} A (Q_{1} \dots Q_{n - 2}),$ where $T$ is tridiagonal
$Q = Q_{1} \dots Q_{n - 2}$ is also orthogonal (why?)
$A$ and $T$ have the same eigenvalues (why?)
The motivation is that $T$ has cheaper matrix operations

31 / 54

The Householder (Tridiagonalization) Algorithm

A very nice intro orthogonal transformatioduction to the Householder algorithm can be found at Section 4.3 and Section 4.6.1 of https://people.inf.ethz.ch/arbenz/ewp/Lnotes/chapter4.pdf

32 / 54

The QR Algorithm

The QR algorithm iteratively reduces the tridiagonal $T$ into a diagonal matrix by orthogonal transformations:

Initialize $T^{(0)} = T$ , and begin the loop $k = 0, 1, \dots$
Compute the QR decomposition of the matrix $T^{(k)}$ : $T^{(k)} = Q_{k} R_{k}$
Update $T^{(k + 1)} = R_{k} Q_{k} = Q_{k}^{'} T^{(k)} Q_{k}$

Assume that $| λ_{1} | > \dots > | λ_{n} | > 0$ , and then $T^{(k)}$ converges to a diagonal matrix, whose diagonal elements are eigenvalues of $T$ and $A$ .

33 / 54

Practical AlgorithmThe lower triangular elements T(k)ijTij(k), i>ji>j converge to zero at the rate of O(|λi/λj|)O(|λi/λj|) [2]
To accelerate the convergence, a shift can be applied to the original matrix:
Compute the QR decomposition of the shifted matrix,
T(k)−σkIn=QkRkT(k)−σkIn=QkRk
Update T(k+1)=RkQk+σkIn=Q′kT(k)QkT(k+1)=RkQk+σkIn=Qk′T(k)Qk
We can take σk=(T(k))nnσk=(T(k))nn until T(k)n−1,n≈0Tn−1,n(k)≈0, and then switch to σk=(T(k))n−1,n−1σk=(T(k))n−1,n−1
34 / 54

Complexity

The Householder algorithm costs $O (n^{3})$
The QR decomposition of tridiagonal matrices has a specialized algorithm, which costs $O (n^{2})$ instead of $O (n^{3})$ (the latter for general dense matrices)
It can be proved that all $T^{(k)}$ matrices are tridiagonal
The number of iterations required in the QR algorithm depends on the distribution of eigenvalues

35 / 54

Largest Eigenvalue36 / 54

Computing the Largest Eigenvalue

In many applications, we only need the largest eigenvalue (in absolute value)
It can be extended to computing the largest $k$ eigenvalues
If $λ_{1}, \dots, λ_{n}$ are eigenvalues of $A$ with $| λ_{1} | > \dots > | λ_{n} |$ , and $γ_{1}, \dots, γ_{n}$ are the associated eigenvectors, then:
- $λ_{2}$ is the largest eigenvalue of $A - λ_{1} γ_{1} γ_{1}^{'}$
- $λ_{3}$ is the largest eigenvalue of $A - λ_{1} γ_{1} γ_{1}^{'} - λ_{2} γ_{2} γ_{2}^{'}$
- ...

37 / 54

Power Method ^[2]

One widely-used and possibly the simplest method to compute the largest eigenvalue (in absolute value) is the power method

38 / 54

Convergence ^[2]

Assume $A$ has eigenvalues $λ_{1} > | λ_{2} | \geq \dots \geq | λ_{n} |$ and associated eigenvectors $γ_{1}, \dots, γ_{n}$ . Also assume the initial vector satisfies $x_{0} = \sum_{i = 1}^{n} β_{i} γ_{i}$ for some $β_{1} \neq 0$ . Then

$| ‖ y_{k} ‖ - λ_{1} | \leq C {(\frac{| λ_{2} |}{λ_{1}})}^{k}, ‖ x_{k} - {\tilde{γ}}_{1} ‖ \leq C {(\frac{| λ_{2} |}{λ_{1}})}^{k},$ where $\tilde{γ_{1}} = \pm γ_{1}$ .

39 / 54

Pros and Cons

The power method only requires the matrix-vector multiplication $v \to A v$ to compute eigenvalues of $A$ (recall the CG method)
This is important and useful for sparse matrices
However, it does not effectively use the information of past iterations
Also, it can only compute one eigenvalue at one time

40 / 54

Advanced Algorithms

Mainstream software packages typically use other advanced algorithms to compute the largest eigenvalues, e.g.,
- Implicitly restarted Lanczos method ^[4]
- Jacobi–Davidson algorithm ^[5]
They also only need the $v \to A v$ operation
Can compute multiple eigenvalues together
These algorithms themselves are quite sophisticated, however

41 / 54

Eigenvalue Closest to μμ42 / 54

Computing Eigenvalue Closest to $μ$

Another type of problem is to find the eigenvalue that is closest to a given number $μ$
For example, if $μ = 0$ , then this is equivalent to finding the smallest eigenvalue (in absolute value)
Fortunately, we can make use of existing methods that compute largest eigenvalues for this task, via a technique called spectral transformation

43 / 54

Spectral Transformation

Spectral transformation is based on the following finding.

If $λ_{i}$ , $i = 1, \dots, n$ are eigenvalues of $A$ , then $A x_{i} = λ_{i} x_{i}$ for the eigenvectors $x_{i}$ . Choose a shift $μ \neq λ_{i}$ , then the eigenvalues of $B = (A - μ I_{n})^{- 1}$ are $θ_{i} = (λ_{i} - μ)^{- 1}$ , and $B θ_{i} = θ_{i} x_{i}$ .

Therefore, the largest eigenvalue of $B$ corresponds to the eigenvalue of $A$ that is closest to $μ$ .

44 / 54

Inverse Iteration

If we want to compute the eigenvalue of $A$ closest to $μ$
We can apply the power method (or other methods based on matrix-vector multiplication) to $B = (A - μ I_{n})^{- 1}$
The iteration becomes $\begin{aligned} y_{k} & = (A - μ I_{n})^{- 1} x_{k - 1} \\ x_{k} & = y_{k} / ‖ y_{k} ‖ \end{aligned}$
By the convergence theorem, we have $‖ y_{k} ‖ \to θ_{1}$
Then recover the eigenvalue of $A$ by $λ_{μ} = μ + 1 / θ_{1}$

45 / 54

Solving Linear Systems

In actual implementation, we do not directly compute the matrix inverse $(A - μ I_{n})^{- 1}$
Instead, we factorize $A - μ I_{n}$ once using LU decomposition or $L D L^{T}$ decomposition
And then solve the linear systems $(A - μ I_{n}) y_{k} = x_{k - 1}$

46 / 54

Implementation

The eigen() function in base R is a very stable and efficient function to compute all eigenvalues
Specify symmetric = TRUE if you know your matrix is symmetric (by default it will detect symmetry)
Specify only.values = TRUE if you do not need eigenvectors

47 / 54

Benchmark

library(dplyr)
set.seed(123)
n = 1000
M = matrix(rnorm(n^2), n)
A = M + t(M)
bench::mark(
    eigen(A),
    eigen(A, symmetric = FALSE),
    eigen(A, symmetric = TRUE),
    eigen(A, symmetric = TRUE, only.values = TRUE),
    min_iterations = 3, max_iterations = 10, check = FALSE
) %>% select(expression, min, median)

## # A tibble: 4 × 3
##   expression                                          min   median
##   <bch:expr>                                     <bch:tm> <bch:tm>
## 1 eigen(A)                                          1.15s    1.17s
## 2 eigen(A, symmetric = FALSE)                       4.13s    4.14s
## 3 eigen(A, symmetric = TRUE)                        1.11s    1.13s
## 4 eigen(A, symmetric = TRUE, only.values = TRUE) 314.12ms 314.19ms

48 / 54

ImplementationTo compute a subset of eigenvalues up to some selection rule, we have several options
The RSpectra package implements the implicitly restarted Lanczos method
The PRIMME package uses the Jacobi-Davidson algorithm
The irlba package is designed for singular value decomposition (next class), but can also be used to compute the largest eigenvalues
My personal favorite is the RSpectra package (this is a highly biased choice, of course)
49 / 54

Implementation

The code below computes three eigenvalues with largest absolute values

library(RSpectra)
e = eigs_sym(A, k = 3, which = "LM")
e$values

## [1]  88.83776  88.31500 -88.98945

head(e$vectors)

##             [,1]         [,2]         [,3]
## [1,]  0.01506291  0.029602478 -0.015855700
## [2,] -0.05124211  0.016404205 -0.014749483
## [3,] -0.01542738 -0.002350839  0.005520141
## [4,]  0.02921002 -0.006653817 -0.003387408
## [5,] -0.02406133  0.039802584  0.028103112
## [6,] -0.01309102 -0.026105929  0.073813795

50 / 54

Implementation

If you want the largest three eigenvalues without taking absolute values, use which = "LA"

e = eigs_sym(A, k = 3, which = "LA")
e$values

## [1] 88.83776 88.31500 87.18646

If no eigenvectors are needed, add the following option

e = eigs_sym(A, k = 3, which = "LA", opts = list(retvec = FALSE))
e$vectors

## NULL

51 / 54

Benchmark

If only part of the eigenvalues are needed, eigs_sym() is much faster than eigen()

bench::mark(
    eigen(A, symmetric = TRUE),
    eigen(A, symmetric = TRUE, only.values = TRUE),
    eigs_sym(A, k = 3, which = "LM"),
    eigs_sym(A, k = 3, which = "LM", opts = list(retvec = FALSE)),
    min_iterations = 3, max_iterations = 10, check = FALSE
) %>% select(min, median)

## # A tibble: 4 × 2
##        min   median
##   <bch:tm> <bch:tm>
## 1    1.12s    1.12s
## 2 310.66ms 310.66ms
## 3 149.93ms 150.31ms
## 4 149.43ms 150.85ms

The difference will be more evident on sparse matrices (next class)

52 / 54

Implementation

If one needs the smallest three eigenvalues (in absolute value)

e = eigs_sym(A, k = 3, which = "LM", sigma = 0)
e$values

## [1]  0.01589015 -0.14248784 -0.26307129

In general, if sigma is not NULL, then the selection rule is applied to $1 / (λ_{i} - σ)$
This only affects the selection rule
The returned eigenvalues are still for $A$

53 / 54

References

[1] Folkmar Bornemann (2018). Numerical Linear Algebra. Springer.

[2] Grégoire Allaire and Sidi Mahmoud Kaber (2008). Numerical linear algebra. Springer.

[3] Åke Björck (2015). Numerical methods in matrix computations. Springer.

[4] Danny C. Sorensen (1997). Implicitly restarted Arnoldi/Lanczos methods for large scale eigenvalue calculations. In Parallel Numerical Algorithms (pp. 119-165). Springer.

[5] Gerard L.G. Sleijpen and Henk A. Van der Vorst (1996). A Jacobi–Davidson iteration method for linear eigenvalue problems. SIAM Journal on Matrix Analysis and Applications, 17(2), 401-425.

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Computational Statistics

Lecture 4

Yixuan Qiu

2022-09-28

Numerical Linear Algebra

Today's Topics

Conjugate Gradient Method

Direct Methods

Iterative Methods

Conjugate Gradient Method

Conjugate Gradient Method

Algorithm [2]

Convergence [2]

Convergence [2]

Condition Number

Insights

Implementation

Implementation

Implementation

Implementation

Implementation

Implementation

When to Use

Eigenvalue Computation

Eigenvalues

Definition

Decomposition Theorem [2]

Basic Properties

Basic Properties

Computing Eigenvalues

Computing Eigenvalues

All Eigenvalues

Computing All Eigenvalues

The Householder (Tridiagonalization) Algorithm

The Householder (Tridiagonalization) Algorithm

The QR Algorithm

Practical Algorithm

Complexity

Largest Eigenvalue

Computing the Largest Eigenvalue

Power Method [2]

Convergence [2]

Pros and Cons

Advanced Algorithms

Eigenvalue Closest to μμ

Computing Eigenvalue Closest to μμ

Spectral Transformation

Inverse Iteration

Solving Linear Systems

Implementation

Benchmark

Implementation

Implementation

Implementation

Benchmark

Implementation

References

Numerical Linear Algebra

Help

Algorithm ^[2]

Convergence ^[2]

Convergence ^[2]

Decomposition Theorem ^[2]

Power Method ^[2]

Convergence ^[2]

Eigenvalue Closest to $μ$

Computing Eigenvalue Closest to $μ$