Modern statistics and data science demand for novel and efficient computing algorithms
Big data break down many existing software packages
Solving complicated models are time-consuming
Hardware resources are not utilized in the most efficient way
High energy consumption but low performance
However, computing has long been an underrated field in statistics
Researchers tend to spend more time on the theoretical properties of models and methods
Practitioners love to directly use software packages and care less about the computing details
Computing itself is less likely to be the primary object of study
A carefully designed computing algorithm may be deemed merely a "trick" rather than solid contributions
Q1. Given two matrices An×m and Bm×n, compute tr(AB).
Q1. Given two matrices An×m and Bm×n, compute tr(AB).
Easy, isn't it?
sum(diag(A %*% B))
Q1. Given two matrices An×m and Bm×n, compute tr(AB).
Easy, isn't it?
sum(diag(A %*% B))
But this is REALLY bad!
Q1. Given two matrices An×m and Bm×n, compute tr(AB).
Easy, isn't it?
sum(diag(A %*% B))
But this is REALLY bad!
set.seed(123)A = matrix(rnorm(1000 * 2000), 2000)B = matrix(rnorm(1000 * 2000), 1000)system.time(sum(diag(A %*% B)))
## user system elapsed ## 2.15 0.00 2.24
Q1. Given two matrices An×m and Bm×n, compute tr(AB).
Easy, isn't it?
sum(diag(A %*% B))
But this is REALLY bad!
set.seed(123)A = matrix(rnorm(1000 * 2000), 2000)B = matrix(rnorm(1000 * 2000), 1000)system.time(sum(diag(A %*% B)))
## user system elapsed ## 2.15 0.00 2.24
"My solution" takes 0.012 seconds.
Q2. Let f(x)∼0.3⋅N(−1,0.1)+0.7⋅N(1,0.01). Write a function to evaluate logf(x). Test the result on x=−15 and x=15.
Q2. Let f(x)∼0.3⋅N(−1,0.1)+0.7⋅N(1,0.01). Write a function to evaluate logf(x). Test the result on x=−15 and x=15.
logf = function(x){ f = 0.3 * dnorm(x, -1, sqrt(0.1)) + 0.7 * dnorm(x, 1, 0.1) log(f)}logf(c(-15, 15))
## [1] -Inf -Inf
Q2. Let f(x)∼0.3⋅N(−1,0.1)+0.7⋅N(1,0.01). Write a function to evaluate logf(x). Test the result on x=−15 and x=15.
logf = function(x){ f = 0.3 * dnorm(x, -1, sqrt(0.1)) + 0.7 * dnorm(x, 1, 0.1) log(f)}logf(c(-15, 15))
## [1] -Inf -Inf
Any better solution?
Q2. Let f(x)∼0.3⋅N(−1,0.1)+0.7⋅N(1,0.01). Write a function to evaluate logf(x). Test the result on x=−15 and x=15.
logf = function(x){ f = 0.3 * dnorm(x, -1, sqrt(0.1)) + 0.7 * dnorm(x, 1, 0.1) log(f)}logf(c(-15, 15))
## [1] -Inf -Inf
Any better solution?
The correct results are -980.972, -1280.972.
Q2. Let f(x)∼0.3⋅N(−1,0.1)+0.7⋅N(1,0.01). Write a function to evaluate logf(x). Test the result on x=−15 and x=15.
logf = function(x){ f = 0.3 * dnorm(x, -1, sqrt(0.1)) + 0.7 * dnorm(x, 1, 0.1) log(f)}logf(c(-15, 15))
## [1] -Inf -Inf
Any better solution?
The correct results are -980.972, -1280.972.
What if we need ∂logf(x)/∂x, e.g., in Hamiltonian Monte Carlo?
Q3. Computing predicted values of linear regression ^y=X(X′X)−1X′Y.
Q3. Computing predicted values of linear regression ^y=X(X′X)−1X′Y.
Literally translate the formula:
X %*% solve(t(X) %*% X) %*% t(X) %*% Y
Q3. Computing predicted values of linear regression ^y=X(X′X)−1X′Y.
Literally translate the formula:
X %*% solve(t(X) %*% X) %*% t(X) %*% Y
NEVER, ever, do this!
Q3. Computing predicted values of linear regression ^y=X(X′X)−1X′Y.
Literally translate the formula:
X %*% solve(t(X) %*% X) %*% t(X) %*% Y
NEVER, ever, do this!
set.seed(123)X = matrix(rnorm(5000 * 500), 5000)Y = rnorm(5000)system.time(X %*% solve(t(X) %*% X) %*% t(X) %*% Y)
## user system elapsed ## 10.49 0.03 11.08
Q3. Computing predicted values of linear regression ^y=X(X′X)−1X′Y.
Literally translate the formula:
X %*% solve(t(X) %*% X) %*% t(X) %*% Y
NEVER, ever, do this!
set.seed(123)X = matrix(rnorm(5000 * 500), 5000)Y = rnorm(5000)system.time(X %*% solve(t(X) %*% X) %*% t(X) %*% Y)
## user system elapsed ## 10.49 0.03 11.08
"My solution" takes 0.594 seconds.
Q4. Computing the operator norm of a matrix Am×n, m<n. ∥A∥op=√λmax(A′A).
Q4. Computing the operator norm of a matrix Am×n, m<n. ∥A∥op=√λmax(A′A).
By definition:
sqrt(eigen(t(A) %*% A)$values[1])
Q4. Computing the operator norm of a matrix Am×n, m<n. ∥A∥op=√λmax(A′A).
By definition:
sqrt(eigen(t(A) %*% A)$values[1])
How fast is it on a matrix of size, say, 1000 x 2000?
Q4. Computing the operator norm of a matrix Am×n, m<n. ∥A∥op=√λmax(A′A).
By definition:
sqrt(eigen(t(A) %*% A)$values[1])
How fast is it on a matrix of size, say, 1000 x 2000?
set.seed(123)A = matrix(rnorm(1000 * 2000), 1000)system.time(sqrt(eigen(t(A) %*% A)$values[1]))
## user system elapsed ## 18.71 0.01 19.98
Q4. Computing the operator norm of a matrix Am×n, m<n. ∥A∥op=√λmax(A′A).
By definition:
sqrt(eigen(t(A) %*% A)$values[1])
How fast is it on a matrix of size, say, 1000 x 2000?
set.seed(123)A = matrix(rnorm(1000 * 2000), 1000)system.time(sqrt(eigen(t(A) %*% A)$values[1]))
## user system elapsed ## 18.71 0.01 19.98
Any method to accelerate it?
Q4. Computing the operator norm of a matrix Am×n, m<n. ∥A∥op=√λmax(A′A).
By definition:
sqrt(eigen(t(A) %*% A)$values[1])
How fast is it on a matrix of size, say, 1000 x 2000?
set.seed(123)A = matrix(rnorm(1000 * 2000), 1000)system.time(sqrt(eigen(t(A) %*% A)$values[1]))
## user system elapsed ## 18.71 0.01 19.98
Any method to accelerate it?
"My solution" takes 0.351 seconds.
Perhaps you have already used such "bad" code in your projects
But maybe you have not identified the issues in these examples yet
If this is the case, then this course is going to help you!
It boosts modern statistical research
Many problems are not yet well resolved and are good research topics
It helps us to make the best use of computational resources
It is fun!
Statistical Computing: Computational methods that enable statistical methods
Computational Statistics: Statistical computing + statistical methods that are computationally intensive
Statistical computing is a multidisciplinary field
Computational statistics builds on top of mathematical statistics, statistical computing, and applied statistics
This course would primarily focus on fundamental computing techniques
Will also discuss some advanced topics in modern computational statistics
Numerical Linear Algebra
Optimization
Simulation/Sampling
We will introduce numerical tools that are useful in statistical computing
We may not elaborate the details of every algorithm, but will introduce its motivation and function
You do not need to be an expert on each method
But you need to know their existence, so that you can quickly recall them when you encounter a related problem
Some important methods and algorithms will be left as homework or course projects
Understand the details well
Go through and verify the theoretical properties
Implement the algorithm correctly and efficiently
Triangular and tridiagonal systems
LU decomposition
Cholesky decomposition
Conjugate gradient method
LDLT decomposition
QR decomposition
Eigenvalue/Singular value decomposition
Basic sparse matrix operations
(Sub)Gradient descent and projected (sub)gradient descent
Newton's method
Proximal gradient descent
L-BFGS and L-BFGS-B
Douglas-Rachford splitting
ADMM
Univariate random number generation
Rejection sampling
Importance sampling
Markov chain Monte Carlo
Measure transport
Numerical stability
Software
Calling C++ code within R and Python
Basic parallel computing methods
Be Accurate
Be Stable
Be Fast
In the classical setting, we are concerned with the regression model
Yn×1=Xn×pβp×1+εn×1.
The ordinary least squares (OLS) solution is well-established:
^β=(X′X)−1X′Y.
Is computing still a problem?
What is the proper method to compute ^β? (Recall the problem in "Test Yourself")
What if X is not full-rank, and how to detect it?
What if n is extremely large?
What if X is sparse?
If n<p, we can fit a ridge regression model, and the coefficient estimate also has a closed form:
^βλ=(X′X+λI)−1X′Y,λ>0.
What is the proper method to compute ^βλ?
What if p is extremely large?
What if X is sparse?
Alternatively, we can use a lasso model, which solves the optimization problem
^βlasso=argminβ 12n∥Y−Xβ∥2+λ∥β∥1.
^βlasso no longer has a closed form.
What is the proper method to compute ^βlasso?
What if X is sparse?
What about other general penalty functions?
In the regression setting, if we force β to be nonnegative, then we need to solve
^βnls=argminβ ∥Y−Xβ∥2subject toβ≥0.
^βnls does not have a closed form either.
What is the proper method to compute ^βnls?
What if X is sparse?
What about other general constraints?
We assume (β,σ2) has a joint prior distribution π(β,σ2), and given (β,σ2), Y|{β,σ2}∼N(Xβ,σ2I).
How to approximate expectations E[f(β,σ)] with respect to the posterior distribution p(β,σ2|Y,X), for some predetermined function f?
How to obtain an independent sample of p(β,σ2|Y,X)?
How to get unbiased estimators for E[f(β,σ)]?
What if n is extremely large?
Problems 1 and 2 mainly involve numerical linear algebra
Problems 3 and 4 are optimization problems
Problem 5 is concerned with simulation and sampling techniques
They would interact with each other
[1] James E. Gentle, Wolfgang Karl Härdle, and Yuichi Mori (2012). Handbook of Computational Statistics. Springer.
[2] Dennis S. Bernstein (2009). Matrix Mathematics: Theory, Facts, and Formulas (Second Edition). Princeton University Press.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |