+ - 0:00:00
Notes for current slide
Notes for next slide

Computational Statistics

Lecture 1

Yixuan Qiu

2023-09-06

1 / 36

2 / 36

The Role of Computing

3 / 36

The Paradox

Modern statistics and data science demand for novel and efficient computing algorithms

  • Big data break down many existing software packages

  • Solving complicated models are time-consuming

  • Hardware resources are not utilized in the most efficient way

  • High energy consumption but low performance

4 / 36

The Paradox

However, computing has long been an underrated field in statistics

  • Researchers tend to spend more time on the theoretical properties of models and methods

  • Practitioners love to directly use software packages and care less about the computing details

  • Computing itself is less likely to be the primary object of study

  • A carefully designed computing algorithm may be deemed merely a "trick" rather than solid contributions

5 / 36

But doing it right is nontrivial

6 / 36

Test Yourself

7 / 36

Q1. Given two matrices An×m and Bm×n, compute tr(AB).

8 / 36

Q1. Given two matrices An×m and Bm×n, compute tr(AB).

Easy, isn't it?

sum(diag(A %*% B))
8 / 36

Q1. Given two matrices An×m and Bm×n, compute tr(AB).

Easy, isn't it?

sum(diag(A %*% B))

But this is REALLY bad!

8 / 36

Q1. Given two matrices An×m and Bm×n, compute tr(AB).

Easy, isn't it?

sum(diag(A %*% B))

But this is REALLY bad!

set.seed(123)
A = matrix(rnorm(1000 * 2000), 2000)
B = matrix(rnorm(1000 * 2000), 1000)
system.time(sum(diag(A %*% B)))
## user system elapsed
## 2.15 0.00 2.24
8 / 36

Q1. Given two matrices An×m and Bm×n, compute tr(AB).

Easy, isn't it?

sum(diag(A %*% B))

But this is REALLY bad!

set.seed(123)
A = matrix(rnorm(1000 * 2000), 2000)
B = matrix(rnorm(1000 * 2000), 1000)
system.time(sum(diag(A %*% B)))
## user system elapsed
## 2.15 0.00 2.24

"My solution" takes 0.012 seconds.

8 / 36

Q2. Let f(x)0.3N(1,0.1)+0.7N(1,0.01). Write a function to evaluate logf(x). Test the result on x=15 and x=15.

9 / 36

Q2. Let f(x)0.3N(1,0.1)+0.7N(1,0.01). Write a function to evaluate logf(x). Test the result on x=15 and x=15.

logf = function(x)
{
f = 0.3 * dnorm(x, -1, sqrt(0.1)) + 0.7 * dnorm(x, 1, 0.1)
log(f)
}
logf(c(-15, 15))
## [1] -Inf -Inf
9 / 36

Q2. Let f(x)0.3N(1,0.1)+0.7N(1,0.01). Write a function to evaluate logf(x). Test the result on x=15 and x=15.

logf = function(x)
{
f = 0.3 * dnorm(x, -1, sqrt(0.1)) + 0.7 * dnorm(x, 1, 0.1)
log(f)
}
logf(c(-15, 15))
## [1] -Inf -Inf

Any better solution?

9 / 36

Q2. Let f(x)0.3N(1,0.1)+0.7N(1,0.01). Write a function to evaluate logf(x). Test the result on x=15 and x=15.

logf = function(x)
{
f = 0.3 * dnorm(x, -1, sqrt(0.1)) + 0.7 * dnorm(x, 1, 0.1)
log(f)
}
logf(c(-15, 15))
## [1] -Inf -Inf

Any better solution?

The correct results are -980.972, -1280.972.

9 / 36

Q2. Let f(x)0.3N(1,0.1)+0.7N(1,0.01). Write a function to evaluate logf(x). Test the result on x=15 and x=15.

logf = function(x)
{
f = 0.3 * dnorm(x, -1, sqrt(0.1)) + 0.7 * dnorm(x, 1, 0.1)
log(f)
}
logf(c(-15, 15))
## [1] -Inf -Inf

Any better solution?

The correct results are -980.972, -1280.972.

What if we need logf(x)/x, e.g., in Hamiltonian Monte Carlo?

9 / 36

Q3. Computing predicted values of linear regression y^=X(XX)1XY.

10 / 36

Q3. Computing predicted values of linear regression y^=X(XX)1XY.

Literally translate the formula:

X %*% solve(t(X) %*% X) %*% t(X) %*% Y
10 / 36

Q3. Computing predicted values of linear regression y^=X(XX)1XY.

Literally translate the formula:

X %*% solve(t(X) %*% X) %*% t(X) %*% Y

NEVER, ever, do this!

10 / 36

Q3. Computing predicted values of linear regression y^=X(XX)1XY.

Literally translate the formula:

X %*% solve(t(X) %*% X) %*% t(X) %*% Y

NEVER, ever, do this!

set.seed(123)
X = matrix(rnorm(5000 * 500), 5000)
Y = rnorm(5000)
system.time(X %*% solve(t(X) %*% X) %*% t(X) %*% Y)
## user system elapsed
## 10.49 0.03 11.08
10 / 36

Q3. Computing predicted values of linear regression y^=X(XX)1XY.

Literally translate the formula:

X %*% solve(t(X) %*% X) %*% t(X) %*% Y

NEVER, ever, do this!

set.seed(123)
X = matrix(rnorm(5000 * 500), 5000)
Y = rnorm(5000)
system.time(X %*% solve(t(X) %*% X) %*% t(X) %*% Y)
## user system elapsed
## 10.49 0.03 11.08

"My solution" takes 0.594 seconds.

10 / 36

Q4. Computing the operator norm of a matrix Am×n, m<n. Aop=λmax(AA).

11 / 36

Q4. Computing the operator norm of a matrix Am×n, m<n. Aop=λmax(AA).

By definition:

sqrt(eigen(t(A) %*% A)$values[1])
11 / 36

Q4. Computing the operator norm of a matrix Am×n, m<n. Aop=λmax(AA).

By definition:

sqrt(eigen(t(A) %*% A)$values[1])

How fast is it on a matrix of size, say, 1000 x 2000?

11 / 36

Q4. Computing the operator norm of a matrix Am×n, m<n. Aop=λmax(AA).

By definition:

sqrt(eigen(t(A) %*% A)$values[1])

How fast is it on a matrix of size, say, 1000 x 2000?

set.seed(123)
A = matrix(rnorm(1000 * 2000), 1000)
system.time(sqrt(eigen(t(A) %*% A)$values[1]))
## user system elapsed
## 18.71 0.01 19.98
11 / 36

Q4. Computing the operator norm of a matrix Am×n, m<n. Aop=λmax(AA).

By definition:

sqrt(eigen(t(A) %*% A)$values[1])

How fast is it on a matrix of size, say, 1000 x 2000?

set.seed(123)
A = matrix(rnorm(1000 * 2000), 1000)
system.time(sqrt(eigen(t(A) %*% A)$values[1]))
## user system elapsed
## 18.71 0.01 19.98

Any method to accelerate it?

11 / 36

Q4. Computing the operator norm of a matrix Am×n, m<n. Aop=λmax(AA).

By definition:

sqrt(eigen(t(A) %*% A)$values[1])

How fast is it on a matrix of size, say, 1000 x 2000?

set.seed(123)
A = matrix(rnorm(1000 * 2000), 1000)
system.time(sqrt(eigen(t(A) %*% A)$values[1]))
## user system elapsed
## 18.71 0.01 19.98

Any method to accelerate it?

"My solution" takes 0.351 seconds.

11 / 36

Do you still think computing is trivial?

  • Perhaps you have already used such "bad" code in your projects

  • But maybe you have not identified the issues in these examples yet

  • If this is the case, then this course is going to help you!

12 / 36

Why We Study Computing

13 / 36

Studying Computing

  • It boosts modern statistical research

  • Many problems are not yet well resolved and are good research topics

  • It helps us to make the best use of computational resources

  • It is fun!

14 / 36

Some (Loose) Definitions [1]

  • Statistical Computing: Computational methods that enable statistical methods

  • Computational Statistics: Statistical computing + statistical methods that are computationally intensive

15 / 36

Scope

Statistical computing is a multidisciplinary field

  • Statistics
  • Computer science
  • Numerical analysis
  • Optimization
  • Database
  • Programming
  • Graphics
  • Software engineering
  • Reproducible research
  • ...
16 / 36

Scope

Computational statistics builds on top of mathematical statistics, statistical computing, and applied statistics

  • Everything that appears in the previous slide
  • Monte Carlo simulation
  • Bootstrap/resampling techniques
  • EM algorithm
  • Density estimation
  • Computational optimal transport
  • ...
17 / 36

Scope

  • This course would primarily focus on fundamental computing techniques

  • Will also discuss some advanced topics in modern computational statistics

18 / 36

Fundamental Problems

  • Numerical Linear Algebra

    • Computations on matrices and vectors
  • Optimization

    • Minimizing objective functions
  • Simulation/Sampling

    • Generating random numbers from statistical distributions
19 / 36

Roadmap

20 / 36

Principle

  • We will introduce numerical tools that are useful in statistical computing

  • We may not elaborate the details of every algorithm, but will introduce its motivation and function

  • You do not need to be an expert on each method

  • But you need to know their existence, so that you can quickly recall them when you encounter a related problem

21 / 36

Course Projects

  • Some important methods and algorithms will be left as homework or course projects

    • Understand the details well

    • Go through and verify the theoretical properties

    • Implement the algorithm correctly and efficiently

22 / 36

Topics on Numerical Linear Algebra

  • Triangular and tridiagonal systems

  • LU decomposition

  • Cholesky decomposition

  • Conjugate gradient method

  • LDLT decomposition

  • QR decomposition

  • Eigenvalue/Singular value decomposition

  • Basic sparse matrix operations

23 / 36

Topics on Optimization

  • (Sub)Gradient descent and projected (sub)gradient descent

  • Newton's method

  • Proximal gradient descent

  • L-BFGS and L-BFGS-B

  • Douglas-Rachford splitting

  • ADMM

24 / 36

Topics on Simulation and Sampling

  • Univariate random number generation

  • Rejection sampling

  • Importance sampling

  • Markov chain Monte Carlo

  • Measure transport

25 / 36

Miscellaneous

  • Numerical stability

  • Software

  • Calling C++ code within R and Python

  • Basic parallel computing methods

26 / 36

Goals of Statistical Computing

  • Be Accurate

  • Be Stable

  • Be Fast

27 / 36

Every Story Starts with Regression

28 / 36

Linear Regression

In the classical setting, we are concerned with the regression model

Yn×1=Xn×pβp×1+εn×1.

The ordinary least squares (OLS) solution is well-established:

β^=(XX)1XY.

Is computing still a problem?

29 / 36

Problem 1: Do OLS right

  • What is the proper method to compute β^? (Recall the problem in "Test Yourself")

  • What if X is not full-rank, and how to detect it?

  • What if n is extremely large?

  • What if X is sparse?

30 / 36

Problem 2: Ridge regression

If n<p, we can fit a ridge regression model, and the coefficient estimate also has a closed form:

β^λ=(XX+λI)1XY,λ>0.

  • What is the proper method to compute β^λ?

  • What if p is extremely large?

  • What if X is sparse?

31 / 36

Problem 3: Lasso

Alternatively, we can use a lasso model, which solves the optimization problem

β^lasso=argminβ 12nYXβ2+λβ1.

β^lasso no longer has a closed form.

  • What is the proper method to compute β^lasso?

  • What if X is sparse?

  • What about other general penalty functions?

32 / 36

Problem 4: Nonnegative Least Squares

In the regression setting, if we force β to be nonnegative, then we need to solve

β^nls=argminβ YXβ2subject toβ0.

β^nls does not have a closed form either.

  • What is the proper method to compute β^nls?

  • What if X is sparse?

  • What about other general constraints?

33 / 36

Problem 5: Bayesian Regression

We assume (β,σ2) has a joint prior distribution π(β,σ2), and given (β,σ2), Y|{β,σ2}N(Xβ,σ2I).

  • How to approximate expectations E[f(β,σ)] with respect to the posterior distribution p(β,σ2|Y,X), for some predetermined function f?

  • How to obtain an independent sample of p(β,σ2|Y,X)?

  • How to get unbiased estimators for E[f(β,σ)]?

  • What if n is extremely large?

34 / 36

Connections

  • Problems 1 and 2 mainly involve numerical linear algebra

  • Problems 3 and 4 are optimization problems

  • Problem 5 is concerned with simulation and sampling techniques

  • They would interact with each other

35 / 36

References

[1] James E. Gentle, Wolfgang Karl Härdle, and Yuichi Mori (2012). Handbook of Computational Statistics. Springer.

[2] Dennis S. Bernstein (2009). Matrix Mathematics: Theory, Facts, and Formulas (Second Edition). Princeton University Press.

36 / 36

2 / 36
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow