Computational Statistics
Lecture 1
Yixuan Qiu
2023-09-06
1 / 36

2 / 36

The Role of Computing3 / 36

The Paradox

Modern statistics and data science demand for novel and efficient computing algorithms

Big data break down many existing software packages
Solving complicated models are time-consuming
Hardware resources are not utilized in the most efficient way
High energy consumption but low performance

4 / 36

The Paradox

However, computing has long been an underrated field in statistics

Researchers tend to spend more time on the theoretical properties of models and methods
Practitioners love to directly use software packages and care less about the computing details
Computing itself is less likely to be the primary object of study
A carefully designed computing algorithm may be deemed merely a "trick" rather than solid contributions

5 / 36

But doing it right is nontrivial6 / 36

Test Yourself7 / 36

Q1. Given two matrices $A_{n \times m}$ and $B_{m \times n}$ , compute $t r (A B)$ .

8 / 36

Q1. Given two matrices $A_{n \times m}$ and $B_{m \times n}$ , compute $t r (A B)$ .

Easy, isn't it?

sum(diag(A %*% B))

8 / 36

Q1. Given two matrices $A_{n \times m}$ and $B_{m \times n}$ , compute $t r (A B)$ .

Easy, isn't it?

sum(diag(A %*% B))

But this is REALLY bad!

8 / 36

Q1. Given two matrices $A_{n \times m}$ and $B_{m \times n}$ , compute $t r (A B)$ .

Easy, isn't it?

sum(diag(A %*% B))

But this is REALLY bad!

set.seed(123)
A = matrix(rnorm(1000 * 2000), 2000)
B = matrix(rnorm(1000 * 2000), 1000)
system.time(sum(diag(A %*% B)))

##    user  system elapsed 
##    2.15    0.00    2.24

8 / 36

Q1. Given two matrices $A_{n \times m}$ and $B_{m \times n}$ , compute $t r (A B)$ .

Easy, isn't it?

sum(diag(A %*% B))

But this is REALLY bad!

set.seed(123)
A = matrix(rnorm(1000 * 2000), 2000)
B = matrix(rnorm(1000 * 2000), 1000)
system.time(sum(diag(A %*% B)))

##    user  system elapsed 
##    2.15    0.00    2.24

"My solution" takes 0.012 seconds.

8 / 36

Q2. Let $f (x) \sim 0.3 \cdot N (- 1, 0.1) + 0.7 \cdot N (1, 0.01)$ . Write a function to evaluate $\log f (x)$ . Test the result on $x = - 15$ and $x = 15$ .

9 / 36

Q2. Let $f (x) \sim 0.3 \cdot N (- 1, 0.1) + 0.7 \cdot N (1, 0.01)$ . Write a function to evaluate $\log f (x)$ . Test the result on $x = - 15$ and $x = 15$ .

logf = function(x)
{
    f = 0.3 * dnorm(x, -1, sqrt(0.1)) + 0.7 * dnorm(x, 1, 0.1)
    log(f)
}
logf(c(-15, 15))

## [1] -Inf -Inf

9 / 36

Q2. Let $f (x) \sim 0.3 \cdot N (- 1, 0.1) + 0.7 \cdot N (1, 0.01)$ . Write a function to evaluate $\log f (x)$ . Test the result on $x = - 15$ and $x = 15$ .

logf = function(x)
{
    f = 0.3 * dnorm(x, -1, sqrt(0.1)) + 0.7 * dnorm(x, 1, 0.1)
    log(f)
}
logf(c(-15, 15))

## [1] -Inf -Inf

Any better solution?

9 / 36

Q2. Let $f (x) \sim 0.3 \cdot N (- 1, 0.1) + 0.7 \cdot N (1, 0.01)$ . Write a function to evaluate $\log f (x)$ . Test the result on $x = - 15$ and $x = 15$ .

logf = function(x)
{
    f = 0.3 * dnorm(x, -1, sqrt(0.1)) + 0.7 * dnorm(x, 1, 0.1)
    log(f)
}
logf(c(-15, 15))

## [1] -Inf -Inf

Any better solution?

The correct results are -980.972, -1280.972.

9 / 36

Q2. Let $f (x) \sim 0.3 \cdot N (- 1, 0.1) + 0.7 \cdot N (1, 0.01)$ . Write a function to evaluate $\log f (x)$ . Test the result on $x = - 15$ and $x = 15$ .

logf = function(x)
{
    f = 0.3 * dnorm(x, -1, sqrt(0.1)) + 0.7 * dnorm(x, 1, 0.1)
    log(f)
}
logf(c(-15, 15))

## [1] -Inf -Inf

Any better solution?

The correct results are -980.972, -1280.972.

What if we need $\partial \log f (x) / \partial x$ , e.g., in Hamiltonian Monte Carlo?

9 / 36

Q3. Computing predicted values of linear regression $\hat{y} = X (X^{'} X)^{- 1} X^{'} Y$ .

10 / 36

Q3. Computing predicted values of linear regression $\hat{y} = X (X^{'} X)^{- 1} X^{'} Y$ .

Literally translate the formula:

X %*% solve(t(X) %*% X) %*% t(X) %*% Y

10 / 36

Q3. Computing predicted values of linear regression $\hat{y} = X (X^{'} X)^{- 1} X^{'} Y$ .

Literally translate the formula:

X %*% solve(t(X) %*% X) %*% t(X) %*% Y

NEVER, ever, do this!

10 / 36

Q3. Computing predicted values of linear regression $\hat{y} = X (X^{'} X)^{- 1} X^{'} Y$ .

Literally translate the formula:

X %*% solve(t(X) %*% X) %*% t(X) %*% Y

NEVER, ever, do this!

set.seed(123)
X = matrix(rnorm(5000 * 500), 5000)
Y = rnorm(5000)
system.time(X %*% solve(t(X) %*% X) %*% t(X) %*% Y)

##    user  system elapsed 
##   10.49    0.03   11.08

10 / 36

Q3. Computing predicted values of linear regression $\hat{y} = X (X^{'} X)^{- 1} X^{'} Y$ .

Literally translate the formula:

X %*% solve(t(X) %*% X) %*% t(X) %*% Y

NEVER, ever, do this!

set.seed(123)
X = matrix(rnorm(5000 * 500), 5000)
Y = rnorm(5000)
system.time(X %*% solve(t(X) %*% X) %*% t(X) %*% Y)

##    user  system elapsed 
##   10.49    0.03   11.08

"My solution" takes 0.594 seconds.

10 / 36

Q4. Computing the operator norm of a matrix $A_{m \times n}$ , $m < n$ . $‖ A ‖_{o p} = \sqrt{λ_{max} (A^{'} A)}$ .

11 / 36

Q4. Computing the operator norm of a matrix $A_{m \times n}$ , $m < n$ . $‖ A ‖_{o p} = \sqrt{λ_{max} (A^{'} A)}$ .

By definition:

sqrt(eigen(t(A) %*% A)$values[1])

11 / 36

Q4. Computing the operator norm of a matrix $A_{m \times n}$ , $m < n$ . $‖ A ‖_{o p} = \sqrt{λ_{max} (A^{'} A)}$ .

By definition:

sqrt(eigen(t(A) %*% A)$values[1])

How fast is it on a matrix of size, say, 1000 x 2000?

11 / 36

Q4. Computing the operator norm of a matrix $A_{m \times n}$ , $m < n$ . $‖ A ‖_{o p} = \sqrt{λ_{max} (A^{'} A)}$ .

By definition:

sqrt(eigen(t(A) %*% A)$values[1])

How fast is it on a matrix of size, say, 1000 x 2000?

set.seed(123)
A = matrix(rnorm(1000 * 2000), 1000)
system.time(sqrt(eigen(t(A) %*% A)$values[1]))

##    user  system elapsed 
##   18.71    0.01   19.98

11 / 36

Q4. Computing the operator norm of a matrix $A_{m \times n}$ , $m < n$ . $‖ A ‖_{o p} = \sqrt{λ_{max} (A^{'} A)}$ .

By definition:

sqrt(eigen(t(A) %*% A)$values[1])

How fast is it on a matrix of size, say, 1000 x 2000?

set.seed(123)
A = matrix(rnorm(1000 * 2000), 1000)
system.time(sqrt(eigen(t(A) %*% A)$values[1]))

##    user  system elapsed 
##   18.71    0.01   19.98

Any method to accelerate it?

11 / 36

Q4. Computing the operator norm of a matrix $A_{m \times n}$ , $m < n$ . $‖ A ‖_{o p} = \sqrt{λ_{max} (A^{'} A)}$ .

By definition:

sqrt(eigen(t(A) %*% A)$values[1])

How fast is it on a matrix of size, say, 1000 x 2000?

set.seed(123)
A = matrix(rnorm(1000 * 2000), 1000)
system.time(sqrt(eigen(t(A) %*% A)$values[1]))

##    user  system elapsed 
##   18.71    0.01   19.98

Any method to accelerate it?

"My solution" takes 0.351 seconds.

11 / 36

Do you still think computing is trivial?

Perhaps you have already used such "bad" code in your projects
But maybe you have not identified the issues in these examples yet
If this is the case, then this course is going to help you!

12 / 36

Why We Study Computing13 / 36

Studying Computing

It boosts modern statistical research
Many problems are not yet well resolved and are good research topics
It helps us to make the best use of computational resources
It is fun!

14 / 36

Some (Loose) Definitions ^[1]

Statistical Computing: Computational methods that enable statistical methods
Computational Statistics: Statistical computing + statistical methods that are computationally intensive

15 / 36

Scope

Statistical computing is a multidisciplinary field

Statistics
Computer science
Numerical analysis
Optimization
Database
Programming
Graphics
Software engineering
Reproducible research
...

16 / 36

Scope

Computational statistics builds on top of mathematical statistics, statistical computing, and applied statistics

Everything that appears in the previous slide
Monte Carlo simulation
Bootstrap/resampling techniques
EM algorithm
Density estimation
Computational optimal transport
...

17 / 36

Scope

This course would primarily focus on fundamental computing techniques
Will also discuss some advanced topics in modern computational statistics

18 / 36

Fundamental Problems

Numerical Linear Algebra
- Computations on matrices and vectors
Optimization
- Minimizing objective functions
Simulation/Sampling
- Generating random numbers from statistical distributions

19 / 36

Roadmap20 / 36

Principle

We will introduce numerical tools that are useful in statistical computing
We may not elaborate the details of every algorithm, but will introduce its motivation and function
You do not need to be an expert on each method
But you need to know their existence, so that you can quickly recall them when you encounter a related problem

21 / 36

Course Projects

Some important methods and algorithms will be left as homework or course projects
- Understand the details well
- Go through and verify the theoretical properties
- Implement the algorithm correctly and efficiently

22 / 36

Topics on Numerical Linear Algebra

Triangular and tridiagonal systems
LU decomposition
Cholesky decomposition
Conjugate gradient method
$L D L^{T}$ decomposition
QR decomposition
Eigenvalue/Singular value decomposition
Basic sparse matrix operations

23 / 36

Topics on Optimization

(Sub)Gradient descent and projected (sub)gradient descent
Newton's method
Proximal gradient descent
L-BFGS and L-BFGS-B
Douglas-Rachford splitting
ADMM

24 / 36

Topics on Simulation and Sampling

Univariate random number generation
Rejection sampling
Importance sampling
Markov chain Monte Carlo
Measure transport

25 / 36

Miscellaneous

Numerical stability
Software
Calling C++ code within R and Python
Basic parallel computing methods

26 / 36

Goals of Statistical Computing

Be Accurate
Be Stable
Be Fast

27 / 36

Every Story Starts with Regression28 / 36

Linear Regression

In the classical setting, we are concerned with the regression model

$Y_{n \times 1} = X_{n \times p} β_{p \times 1} + ε_{n \times 1} .$

The ordinary least squares (OLS) solution is well-established:

$\hat{β} = (X^{'} X)^{- 1} X^{'} Y .$

Is computing still a problem?

29 / 36

Problem 1: Do OLS right

What is the proper method to compute $\hat{β}$ ? (Recall the problem in "Test Yourself")
What if $X$ is not full-rank, and how to detect it?
What if $n$ is extremely large?
What if $X$ is sparse?

30 / 36

Problem 2: Ridge regression

If $n < p$ , we can fit a ridge regression model, and the coefficient estimate also has a closed form:

${\hat{β}}_{λ} = (X^{'} X + λ I)^{- 1} X^{'} Y, λ > 0.$

What is the proper method to compute ${\hat{β}}_{λ}$ ?
What if $p$ is extremely large?
What if $X$ is sparse?

31 / 36

Problem 3: Lasso

Alternatively, we can use a lasso model, which solves the optimization problem

${\hat{β}}_{l a s s o} = \underset{β}{\arg min} \frac{1}{2 n} ‖ Y - X β ‖^{2} + λ ‖ β ‖_{1} .$

${\hat{β}}_{l a s s o}$ no longer has a closed form.

What is the proper method to compute ${\hat{β}}_{l a s s o}$ ?
What if $X$ is sparse?
What about other general penalty functions?

32 / 36

Problem 4: Nonnegative Least Squares

In the regression setting, if we force $β$ to be nonnegative, then we need to solve

${\hat{β}}_{n l s} = \underset{β}{\arg min} ‖ Y - X β ‖^{2} subject to β \geq 0.$

${\hat{β}}_{n l s}$ does not have a closed form either.

What is the proper method to compute ${\hat{β}}_{n l s}$ ?
What if $X$ is sparse?
What about other general constraints?

33 / 36

Problem 5: Bayesian Regression

We assume $(β, σ^{2})$ has a joint prior distribution $π (β, σ^{2})$ , and given $(β, σ^{2})$ , $Y | {β, σ^{2}} \sim N (X β, σ^{2} I)$ .

How to approximate expectations $E [f (β, σ)]$ with respect to the posterior distribution $p (β, σ^{2} | Y, X)$ , for some predetermined function $f$ ?
How to obtain an independent sample of $p (β, σ^{2} | Y, X)$ ?
How to get unbiased estimators for $E [f (β, σ)]$ ?
What if $n$ is extremely large?

34 / 36

Connections

Problems 1 and 2 mainly involve numerical linear algebra
Problems 3 and 4 are optimization problems
Problem 5 is concerned with simulation and sampling techniques
They would interact with each other

35 / 36

References

[1] James E. Gentle, Wolfgang Karl Härdle, and Yuichi Mori (2012). Handbook of Computational Statistics. Springer.

[2] Dennis S. Bernstein (2009). Matrix Mathematics: Theory, Facts, and Formulas (Second Edition). Princeton University Press.

36 / 36

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Computational Statistics

Lecture 1

Yixuan Qiu

2023-09-06

The Role of Computing

The Paradox

The Paradox

But doing it right is nontrivial

Test Yourself

Do you still think computing is trivial?

Why We Study Computing

Studying Computing

Some (Loose) Definitions [1]

Scope

Scope

Scope

Fundamental Problems

Roadmap

Principle

Course Projects

Topics on Numerical Linear Algebra

Topics on Optimization

Topics on Simulation and Sampling

Miscellaneous

Goals of Statistical Computing

Every Story Starts with Regression

Linear Regression

Problem 1: Do OLS right

Problem 2: Ridge regression

Problem 3: Lasso

Problem 4: Nonnegative Least Squares

Problem 5: Bayesian Regression

Connections

References

Help

Some (Loose) Definitions ^[1]