Yixuan's Homepage
https://statr.me/
Recent content on Yixuan's HomepageHugo -- gohugo.ioen-usTue, 29 Dec 2020 00:00:00 +0000Installing Solus from QEMU
https://statr.me/2020/12/install-solus-from-qemu/
Tue, 29 Dec 2020 00:00:00 +0000https://statr.me/2020/12/install-solus-from-qemu/<blockquote>
<p>This article is mainly used to record my own experience on installing the Solus operating system to
a new machine, during which I encountered troubles using the regular method. It may not be very helpful
for general users, but if you had similar experience as mine, then the method introduced here is one
possible solution.</p>
</blockquote>
<p>I typically install three operating systems (OS) side by side on my machine for daily use: Windows,
<a href="https://manjaro.org/">Manjaro Linux</a>, and <a href="https://getsol.us/">Solus</a>. Solus is a very young OS,
but with a strong community for development and maintenance. It has a modern design, and is well
tuned and optimized, making it very suitable for programming and development. It is especially useful
if you want to run GPU-based programs, as it has a nice integration with video card drivers.</p>
<p>Recently I got a new desktop equipped with the RTX 3070 video card, on which I intend to run some deep
learning code. I had a pleasant experience working with Solus on my laptop, so I also decided to install
Solus on the new machine. However, <a href="https://getsol.us/articles/installation/preparing-to-install/en/">the regular method</a>
introduced in the official guide did not succeed, possibly because the video card is relatively new, and the
live CD could not even boot.</p>
<p>After several rounds of searching on web, I still couldn’t find a solution, and my last resort was to
install the OS via a virtual machine. Here I do not mean installing Solus to a virtual machine.
Instead, what I want is to boot the live CD within a virtual machine, and then install the OS to the
real machine.</p>
<p>There are many choices for virtual machine software, and finally I used <a href="https://www.qemu.org/">QEMU</a>
since it is very lightweight and does not require heavy configuration.
The following command immediately opens a virtual machine that runs the <a href="https://mirrors.rit.edu/solus/images/4.1/Solus-4.1-Budgie.iso">Solus live CD</a>:</p>
<pre><code class="language-bash">qemu-system-x86_64 -boot d -cdrom Solus-4.1-Budgie.iso -m 4096
</code></pre>
<div align="center">
<img src="https://upload.yixuan.blog/en/2020/12/qemu.png" alt="QEMU" />
</div>
<p>But there are two problems here. First, as explained in
<a href="https://getsol.us/articles/troubleshooting/installation-issues/en/">this article</a>,
the live CD will attempt to install Solus by the same method it was booted, which means that
if the live CD was not booted in <a href="https://en.wikipedia.org/wiki/EFI_system_partition">EFI mode</a>,
then it would not provide the EFI option for installation to the real machine. QEMU by default
does not provide the EFI boot option, so the virtual machine above was booted in legacy mode,
which can be verified by running the <code>ls /sys/firmware/efi</code> command.</p>
<p>Fortunately, there is a method to add EFI support to QEMU. Following the instructions
<a href="https://unix.stackexchange.com/a/57221">here</a>, we need to extract the file <code>bios.bin</code>
from the <a href="http://download.opensuse.org/repositories/home:/jejb1:/UEFI/openSUSE_Tumbleweed/x86_64/OVMF-0.1+20160502+gd0a23f9-2.13.x86_64.rpm">OVMF rpm package</a>,
and then run QEMU with the <code>-bios</code> option:</p>
<pre><code>qemu-system-x86_64 -bios bios.bin -boot d -cdrom Solus-4.1-Budgie.iso -m 4096
</code></pre>
<div align="center">
<img src="https://upload.yixuan.blog/en/2020/12/bios-efi.png" alt="EFI" />
</div>
<p>This time the live CD was indeed booted in the EFI mode.</p>
<div align="center">
<img src="https://upload.yixuan.blog/en/2020/12/qemu-efi.png" alt="QEMU with EFI" />
</div>
<p>And the second problem is more crucial: we want to install the OS to the real hard disk,
but typically the virtual machine works on a virtual file system. To enable QEMU to
modify the physical hard disk, we need to pass in the <code>-hda</code> (or <code>-hdb</code> etc.) option.
Suppose we want to map the <code>/dev/sdb</code> physical hard disk to the first virtual hard disk, we can do</p>
<pre><code>sudo qemu-system-x86_64 -bios bios.bin -boot d -cdrom Solus-4.1-Budgie.iso -hda /dev/sdb -m 4096
</code></pre>
<p>Note that this command must be run with <code>sudo</code>, since modifying <code>/dev/sdb</code> is a super user operation.</p>
<p>Finally, I was able to run Solus live CD in EFI mode and install the system to the real machine.
The remaining tasks were just routines: upgrading the whole system using <code>sudo eopkg upgrade</code> for better
hardware support, and installing the NVIDIA video card driver with <code>sudo eopkg install nvidia-glx-driver-current</code>.
After that I was able to run Solus on the new machine smoothly.</p>
MCMC notes (1)
https://statr.me/2019/12/mcmc-notes-1/
Sat, 28 Dec 2019 00:00:00 +0000https://statr.me/2019/12/mcmc-notes-1/<p>Recently I was reading articles and books about MCMC, and realized that many materials were not taught in my
graduate study. To this end, I decide to make a summary of such content, to assist readers and myself to gain
deeper understanding of MCMC in the future. I hope to make this topic a series, although I cannot guarantee
its completion. This article is the first one in this hypothetical series, and we are going to introduce
an important concept, the <strong>geometric ergodicity</strong>.</p>
<p>MCMC is an extremely broad topic, so we start with a classical algorithm, the <strong>Gibbs sampler</strong>. Our target is
to sample from a joint distribution $p(x,y)$, which however may have a complicated form. As a result, it is
typically hard to directly obtain samples from $p(x,y)$, but in many cases the two conditional distributions,
$p(x|y)$ and $p(y|x)$, have simpler forms and exact samplers. This is where the Gibbs sampler can help.
Starting with an arbitrary initial value $X_0$, the Gibbs sampler proceeds with the following iterations:</p>
<ol>
<li>Sample $Y_i\sim p(y|x=X_i)$</li>
<li>Sample $X_{i+1}\sim p(x|y=Y_i)$</li>
</ol>
<p>Then under some conditions, the distribution of $(X_i,Y_i)$ will converge to $p(x,y)$ as $i$ increases.</p>
<p>To visually demonstrate this process, we use the example in
<a href="http://www.ccs.neu.edu/home/vip/teach/DMcourse/5_topicmodel_summ/notes_slides/sampling/notes-gibbs-metro.pdf">this document</a>.
First define the following joint distribution:</p>
<p>$$p(x,y)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\binom{n}{x}y^{x+a-1}(1-y)^{n-x+b-1}.$$</p>
<p>Yes, I know it looks horrible, but with some calculations we can show that</p>
<p>$$X|\{Y=y\}\sim Binomial(n,y),\quad Y|\{X=x\}\sim Beta(a+x,b+n-x).$$</p>
<p>In other words, both conditional distributions are familiar. Using the Gibbs sampler, we construct the
following iterations:</p>
<ol>
<li>Sample $Y_i\sim Beta(a+X_i,b+n-X_i)$</li>
<li>Sample $X_{i+1}\sim Binomial(n,Y_i)$</li>
</ol>
<p>Fix $a=2,b=5,n=30$, and then we obtain the $X$ samples under different $i$.</p>
<pre><code class="language-r">n = 30
a = 2
b = 5
sample_y = function(x, a, b, n) rbeta(length(x), a + x, b + n - x)
sample_x = function(y, a, b, n) rbinom(length(y), size = n, prob = y)
gibbs = function(x0, niter, a, b, n)
{
x = x0
for(i in 1:niter)
{
y = sample_y(x, a, b, n)
x = sample_x(y, a, b, n)
}
list(x = x, y = y)
}
set.seed(123)
x0 = rbinom(10000, size = n, prob = 0.5)
res1 = gibbs(x0, niter = 1, a = a, b = b, n = n)
res10 = gibbs(x0, niter = 10, a = a, b = b, n = n)
res100 = gibbs(x0, niter = 100, a = a, b = b, n = n)
</code></pre>
<p>In order to study the performance of the Gibbs sampler, we use the obtained samples to approximate
the density function of $X$ in a specific iteration, and then compare it with the true density.
Below shows the result under $i=1$, $i=10$, and $i=100$:</p>
<pre><code class="language-r">library(ggplot2)
pmf_x = function(x, a, b, n)
{
freq_x = table(x)
est_pmf = numeric(n + 1)
names(est_pmf) = as.character(0:n)
est_pmf[names(freq_x)] = freq_x / sum(freq_x)
ind = 0:n
true_pmf = choose(n, ind) * beta(a + ind, b + n - ind) / beta(a, b)
list(ind = ind, true = true_pmf, est = est_pmf)
}
vis_x = function(res, a, b, n)
{
pmf = pmf_x(res$x, a, b, n)
gdat = data.frame(ind = rep(pmf$ind, 2), den = c(pmf$est, pmf$true),
type = rep(c("Gibbs", "True"), each = n + 1))
ggplot(gdat, aes(x = ind, y = den, fill = type)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.5) +
scale_fill_brewer("Type", type = "qual", palette = "Set1") +
xlab("x") + ylab("Density") +
theme_bw()
}
vis_x(res1, a, b, n)
vis_x(res10, a, b, n)
vis_x(res100, a, b, n)
</code></pre>
<div align="center">
<img src="https://yixuan.cos.name/cn/images/mcmc-step-1.png" alt="MCMC Step 1" />
<img src="https://yixuan.cos.name/cn/images/mcmc-step-10.png" alt="MCMC Step 10" />
<img src="https://yixuan.cos.name/cn/images/mcmc-step-100.png" alt="MCMC Step 100" />
</div>
<p>Clearly, even if the Gibbs samples at the beginning are far away from the true distribution, their
difference is significantly reduced after 10 iterations. At last, with 100 iterations, the
difference is almost invisible.</p>
<p>So here comes one core question in MCMC that is remarkably important yet hard to answer:
<strong>how many steps of iterations are sufficient?</strong>.</p>
<p>To answer this question, we need to first define a metric to evaluate the difference between
two distributions. In the example above, the domain of $X$ is $0,1,\ldots,n$. Let $p(x)$
denote the true density function, and $q^{(k)}(x)$ the density of Gibbs samples after $k$
iterations. Then we compute</p>
<p>$$d_{TV}(p,q^{(k)})=\frac{1}{2}\sum_{x=0}^n\vert p(x)-q^{(k)}(x)\vert,$$</p>
<p>which is known as the total variation (TV) distance. We slightly modify our previous code,
and record the value of $d_{TV}(p,q^{(k)})$ after each iteration.</p>
<pre><code class="language-r">dtv = function(x, a, b, n)
{
pmf = pmf_x(x, a, b, n)
0.5 * sum(abs(pmf$est - pmf$true))
}
gibbs_dtv = function(x0, niter, a, b, n)
{
tv = c()
x = x0
for(i in 1:niter)
{
y = sample_y(x, a, b, n)
x = sample_x(y, a, b, n)
tv = c(tv, dtv(x, a, b, n))
}
tv
}
set.seed(123)
x0 = rbinom(10000, size = n, prob = 0.5)
tv = gibbs_dtv(x0, niter = 100, a = a, b = b, n = n)
qplot(1:100, tv, geom = c("point", "line")) +
xlab("# Gibbs Steps") + ylab("TV Distance") + theme_bw()
qplot(1:100, log10(tv), geom = c("point", "line")) +
xlab("# Gibbs Steps") + ylab("log(TV Distance)") + theme_bw()
</code></pre>
<div align="center">
<img src="https://yixuan.cos.name/cn/images/mcmc-tv.png" alt="TV distance" />
</div>
<p>The plot on the left shows the evolution of $d_{TV}(p,q^{(k)})$ with $k$, and the plot
on the right illustrates the logarithm of $d_{TV}(p,q^{(k)})$. Interestingly, the dots
in the second plot roughly form a straight line in the early stage, which indicates that
the TV distance between Gibbs samples and the true density almost decays <strong>exponentially</strong>.
The distance stays around a constant close to zero when $k$ is greater than 25. This is
because $q^{(k)}(x)$ is estimated from a random sample, and hence there exists a random
error that is associated with the sample size and does not disappear as $k$ increases.</p>
<p>I would like to point out that the property that TV distance decays exponentially plays
a crucial role in the analysis of MCMC methods. Formally speaking, if the TV distance
between the true distribution and the distribution of MCMC samples after finite steps,
$d_{TV}(p,q^{(k)})$, decays exponentially with $k$, i.e., there exists constants
$C>0$ and $\rho>0$ such that</p>
<p>$$d_{TV}(p,q^{(k)})\le Ce^{-\rho k},$$</p>
<p>then we say that this MCMC algorithm is <strong>geometric ergodic</strong>. Geometric ergodicity
is important because it implies <strong>fast</strong> convergence of MCMC. In other words,
an MCMC algorithm that is geometric ergodic requires only a few iterations to
approximate the true distribution, as our previous example shows. In fact, most
of the commonly used MCMC algorithms are geometric ergodic, but to prove it
rigorously is quite technical. We may cover this point in the future.</p>
<p>Reference: Sean Meyn and Richard Tweedie (1993). Markov Chains and Stochastic Stability, Springer.</p>
Updates on RSpectra: new "center" and "scale" parameters for svds()
https://statr.me/2019/11/rspectra-center-scale/
Fri, 29 Nov 2019 00:00:00 +0000https://statr.me/2019/11/rspectra-center-scale/<p>Per the suggestion by <a href="https://github.com/robmaz">@robmaz</a>, <code>RSpectra::svds()</code>
now has two new parameters <code>center</code> and <code>scale</code>, to support implicit centering
and scaling of matrices in partial SVD. The minimum version for this new feature
is <code>RSpectra >= 0.16-0</code>.</p>
<p>These two parameters are very useful for principal component analysis (PCA)
based on the covariance or correlation matrix, without actually forming them.
Below we simulate a random data matrix, and use both R’s built-in
<code>prcomp()</code> and the <code>svds()</code> function in <code>RSpectra</code> to compute PCA.</p>
<pre><code class="language-r">library(RSpectra)
library(Matrix)
# Simulate data matrix
set.seed(123)
n = 2000
p = 5000
k = 10
x = matrix(rnorm(n * p), n)
# R's built-in function
system.time(res1 <- prcomp(x, center = TRUE, scale. = FALSE, rank. = k))
</code></pre>
<pre><code>## user system elapsed
## 6.918 0.135 7.053
</code></pre>
<pre><code class="language-r"># svds()
system.time(res2 <- svds(x, k, nu = 0, opts = list(center = TRUE, scale = FALSE)))
</code></pre>
<pre><code>## user system elapsed
## 1.432 0.000 1.432
</code></pre>
<pre><code class="language-r"># Check explained variances
head(res1$sdev, k)
</code></pre>
<pre><code>## [1] 2.581690 2.569330 2.562992 2.560865 2.558364 2.552452 2.547646 2.545435
## [9] 2.542844 2.538125
</code></pre>
<pre><code class="language-r"># Here we need to normalize the result by `sqrt(n-1)` if `scale = FALSE`
res2$d / sqrt(n - 1)
</code></pre>
<pre><code>## [1] 2.581690 2.569330 2.562992 2.560865 2.558364 2.552452 2.547646 2.545435
## [9] 2.542844 2.538125
</code></pre>
<pre><code class="language-r"># Check factor loadings (eigenvectors)
head(res1$rotation)
</code></pre>
<pre><code>## PC1 PC2 PC3 PC4 PC5
## [1,] 0.002955287 0.009469244 0.001791091 -0.004472325 0.0008397064
## [2,] 0.039091388 0.002317985 -0.016254929 -0.030014596 -0.0037889699
## [3,] 0.011089497 -0.014628014 0.024707183 -0.022815585 0.0142715032
## [4,] 0.019378279 0.003213703 0.011944642 0.005600938 0.0256555170
## [5,] 0.011824996 -0.025775262 -0.008988052 -0.008098504 0.0393693188
## [6,] -0.001385724 -0.014513030 0.035388546 -0.011516888 0.0003066008
## PC6 PC7 PC8 PC9 PC10
## [1,] 0.015071818 0.00461462 0.018306793 -0.005177352 -0.017118890
## [2,] 0.010304461 -0.01921043 -0.015591374 -0.012862728 0.019834698
## [3,] -0.001323470 -0.01548052 0.009874774 -0.006201647 0.035196107
## [4,] 0.019699390 0.01027901 0.008836943 0.020698740 -0.003105694
## [5,] 0.011788053 -0.02337262 0.058207087 -0.010045558 0.006709223
## [6,] -0.003844848 0.01498251 0.012312399 -0.011870723 0.032744347
</code></pre>
<pre><code class="language-r">head(res2$v)
</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.002955287 -0.009469244 -0.001791091 -0.004472325 -0.0008397064
## [2,] 0.039091388 -0.002317985 0.016254929 -0.030014596 0.0037889699
## [3,] 0.011089497 0.014628014 -0.024707183 -0.022815585 -0.0142715032
## [4,] 0.019378279 -0.003213703 -0.011944642 0.005600938 -0.0256555170
## [5,] 0.011824996 0.025775262 0.008988052 -0.008098504 -0.0393693188
## [6,] -0.001385724 0.014513030 -0.035388546 -0.011516888 -0.0003066008
## [,6] [,7] [,8] [,9] [,10]
## [1,] 0.015071818 -0.00461462 -0.018306793 -0.005177352 -0.017118890
## [2,] 0.010304461 0.01921043 0.015591374 -0.012862728 0.019834698
## [3,] -0.001323470 0.01548052 -0.009874774 -0.006201647 0.035196107
## [4,] 0.019699390 -0.01027901 -0.008836943 0.020698740 -0.003105694
## [5,] 0.011788053 0.02337262 -0.058207087 -0.010045558 0.006709223
## [6,] -0.003844848 -0.01498251 -0.012312399 -0.011870723 0.032744347
</code></pre>
<p>We can see that the two methods generate the same results (note that eigenvectors
are identical up to signs), but <code>svds()</code> is much faster than <code>prcomp()</code> since it
only computes the leading singular values.</p>
<p>The performance advantage of <code>svds()</code> is more evident if the input matrix is also
sparse. We repeat the experiment above, but on a different input:</p>
<pre><code class="language-r"># Simulate data matrix
set.seed(123)
n = 2000
p = 5000
k = 10
xsp = rnorm(n * p)
# 90% of the values are zero
xsp[sample(n * p, n * p * 0.9)] = 0
xsp = Matrix(xsp, n, p, sparse = TRUE)
# R's built-in function
system.time(res1 <- prcomp(xsp, center = TRUE, scale. = FALSE, rank. = k))
</code></pre>
<pre><code>## user system elapsed
## 6.656 0.159 6.815
</code></pre>
<pre><code class="language-r"># svds()
system.time(res2 <- svds(xsp, k, nu = 0, opts = list(center = TRUE, scale = FALSE)))
</code></pre>
<pre><code>## user system elapsed
## 0.47 0.00 0.47
</code></pre>
<pre><code class="language-r"># Check explained variances
head(res1$sdev, k)
</code></pre>
<pre><code>## [1] 0.8185087 0.8153621 0.8149581 0.8130531 0.8114726 0.8096282 0.8080887
## [8] 0.8076944 0.8058630 0.8052317
</code></pre>
<pre><code class="language-r"># Here we need to normalize the result by `sqrt(n-1)` if `scale = FALSE`
res2$d / sqrt(n - 1)
</code></pre>
<pre><code>## [1] 0.8185087 0.8153621 0.8149581 0.8130531 0.8114726 0.8096282 0.8080887
## [8] 0.8076944 0.8058630 0.8052317
</code></pre>
<pre><code class="language-r"># Check factor loadings (eigenvectors)
head(res1$rotation)
</code></pre>
<pre><code>## PC1 PC2 PC3 PC4 PC5
## [1,] -0.014078976 0.009977389 -0.0048860045 0.036143696 0.020856379
## [2,] 0.005891397 -0.006337634 0.0086567191 -0.024204012 0.006019543
## [3,] -0.004839200 0.006800550 0.0057230097 0.001902374 0.009481416
## [4,] 0.012187115 0.007566681 -0.0007558701 0.037469119 -0.016597866
## [5,] 0.017071907 -0.010494435 -0.0074782229 0.015558749 0.004890078
## [6,] -0.013842080 0.007694404 0.0018808309 -0.013330489 -0.017356824
## PC6 PC7 PC8 PC9 PC10
## [1,] -6.246730e-03 0.001227333 0.005002885 -0.019607028 0.005792201
## [2,] 1.020680e-02 0.007968355 -0.028889050 0.008189175 -0.006268807
## [3,] -4.108971e-02 0.011208912 0.005501149 -0.007212123 -0.021556508
## [4,] 2.665231e-02 -0.004031854 -0.010210344 -0.011379201 0.004734222
## [5,] 7.927843e-04 -0.003725163 -0.005354887 0.004972582 0.005005228
## [6,] -5.537012e-06 0.010373699 -0.019280452 0.019368409 -0.014930834
</code></pre>
<pre><code class="language-r">head(res2$v)
</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.014078976 -0.009977389 0.0048860045 -0.036143696 -0.020856379
## [2,] -0.005891397 0.006337634 -0.0086567191 0.024204012 -0.006019543
## [3,] 0.004839200 -0.006800550 -0.0057230097 -0.001902374 -0.009481416
## [4,] -0.012187115 -0.007566681 0.0007558701 -0.037469119 0.016597866
## [5,] -0.017071907 0.010494435 0.0074782229 -0.015558749 -0.004890078
## [6,] 0.013842080 -0.007694404 -0.0018808309 0.013330489 0.017356824
## [,6] [,7] [,8] [,9] [,10]
## [1,] 6.246730e-03 0.001227333 0.005002885 0.019607028 0.005792201
## [2,] -1.020680e-02 0.007968355 -0.028889050 -0.008189175 -0.006268807
## [3,] 4.108971e-02 0.011208912 0.005501149 0.007212123 -0.021556508
## [4,] -2.665231e-02 -0.004031854 -0.010210344 0.011379201 0.004734222
## [5,] -7.927843e-04 -0.003725163 -0.005354887 -0.004972582 0.005005228
## [6,] 5.537012e-06 0.010373699 -0.019280452 -0.019368409 -0.014930834
</code></pre>
<p>The above code demonstrates a roughly 15x speedup.</p>
<p>To perform PCA based on the correlation matrix, simply specify <code>scale = TRUE</code>. Below
gives an example of two different ways that lead to the same output.</p>
<pre><code class="language-r"># Simulate data matrix
set.seed(123)
n = 2000
p = 5000
k = 10
xsp = rnorm(n * p)
# 90% of the values are zero
xsp[sample(n * p, n * p * 0.9)] = 0
xsp = Matrix(xsp, n, p, sparse = TRUE)
# R's built-in function
system.time(res1 <- prcomp(xsp, center = TRUE, scale. = TRUE, rank. = k))
</code></pre>
<pre><code>## user system elapsed
## 7.084 0.205 7.291
</code></pre>
<pre><code class="language-r"># svds()
system.time(res2 <- svds(xsp, k, nu = 0, opts = list(center = TRUE, scale = TRUE)))
</code></pre>
<pre><code>## user system elapsed
## 0.587 0.000 0.587
</code></pre>
<pre><code class="language-r"># Check explained variances
head(res1$sdev, k)
</code></pre>
<pre><code>## [1] 2.576689 2.566099 2.562755 2.560604 2.551354 2.547781 2.546969 2.540066
## [9] 2.536446 2.534298
</code></pre>
<pre><code class="language-r">res2$d
</code></pre>
<pre><code>## [1] 2.576689 2.566099 2.562755 2.560604 2.551354 2.547781 2.546969 2.540066
## [9] 2.536446 2.534298
</code></pre>
<pre><code class="language-r"># Check factor loadings (eigenvectors)
head(res1$rotation)
</code></pre>
<pre><code>## PC1 PC2 PC3 PC4 PC5
## [1,] -0.006483247 -0.0009410683 -0.019834553 0.032506226 -0.002916057
## [2,] 0.012595033 0.0011791356 0.010536472 -0.020026913 0.025265776
## [3,] -0.001758559 0.0074203003 -0.009177566 -0.001530424 0.007236185
## [4,] 0.010485835 0.0102749040 -0.017666064 0.018482750 -0.027623466
## [5,] 0.019636988 -0.0051696089 -0.004308241 0.019228107 -0.001763454
## [6,] -0.013690684 -0.0025113870 -0.003252875 -0.030508385 -0.003294067
## PC6 PC7 PC8 PC9 PC10
## [1,] -0.001658184 -7.242177e-03 -0.0008240643 -0.007626584 -0.0006768329
## [2,] -0.014348978 7.286501e-03 0.0121723469 0.007891901 0.0271362784
## [3,] -0.012574501 -1.400828e-02 -0.0353942993 -0.009328341 0.0040393007
## [4,] 0.007793619 4.733246e-03 0.0230159209 -0.006839061 -0.0027831289
## [5,] -0.002541560 -2.382465e-05 0.0130737963 0.003924594 -0.0073279903
## [6,] -0.008978843 -4.013950e-03 0.0138156553 0.013757868 0.0160054867
</code></pre>
<pre><code class="language-r">head(res2$v)
</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.006483247 -0.0009410683 0.019834553 0.032506226 0.002916057
## [2,] -0.012595033 0.0011791356 -0.010536472 -0.020026913 -0.025265776
## [3,] 0.001758559 0.0074203003 0.009177566 -0.001530424 -0.007236185
## [4,] -0.010485835 0.0102749040 0.017666064 0.018482750 0.027623466
## [5,] -0.019636988 -0.0051696089 0.004308241 0.019228107 0.001763454
## [6,] 0.013690684 -0.0025113870 0.003252875 -0.030508385 0.003294067
## [,6] [,7] [,8] [,9] [,10]
## [1,] -0.001658184 -7.242177e-03 -0.0008240643 0.007626584 0.0006768329
## [2,] -0.014348978 7.286501e-03 0.0121723469 -0.007891901 -0.0271362784
## [3,] -0.012574501 -1.400828e-02 -0.0353942993 0.009328341 -0.0040393007
## [4,] 0.007793619 4.733246e-03 0.0230159209 0.006839061 0.0027831289
## [5,] -0.002541560 -2.382465e-05 0.0130737963 -0.003924594 0.0073279902
## [6,] -0.008978843 -4.013950e-03 0.0138156553 -0.013757868 -0.0160054867
</code></pre>
Extracting specific lines from a large (compressed) text file
https://statr.me/2018/05/extracting-lines-from-a-large-file/
Sun, 27 May 2018 00:00:00 +0000https://statr.me/2018/05/extracting-lines-from-a-large-file/<p>A few days ago a friend asked me the following question: how to efficiently
extract some specific lines from a large text file, possibily compressed by Gzip?
He mentioned that he tried some R functions such as <code>read.table(skip = ...)</code>,
but found that reading the data was too slow. Hence he was looking for some
alternative ways to extracting the data.</p>
<p>This is a common task in preprocessing large data sets, since in data exploration,
very often we want to peek at a small subset of the whole data to gain some insights.
If the data are stored in a text file, then we want to extract some specific lines
with the given line numbers. After I got one solution, I felt this might be useful for
future reference, so below are some of my notes for this problem.</p>
<p>After a quick search on <a href="https://stackoverflow.com/a/83347">StackOverflow</a>,
it turns out that the solution is to let the
right tool do the right thing. Assuming a UNIX-like environment, the <code>sed</code> command
is the way to go: to extract line 5 to line 8 from file <code>somefile.txt</code>, simply run</p>
<pre><code class="language-bash">sed -n '5,8p' somefile.txt
</code></pre>
<p>This is very straightforward if you want to read consecutive lines, but things are
more complicated here:</p>
<ol>
<li>We need to extract potentially discontiguous lines, for example, the line numbers
are saved in an R vector.</li>
<li>The text file is compressed by Gzip, and we do not want to extract the whole file.</li>
</ol>
<p>The solution to the first point is still using <code>sed</code>: for example, to extract
line 2, 4, and 6, the following command works.</p>
<pre><code class="language-bash">sed -n '2p;4p;6p' somefile.txt
</code></pre>
<p>Even better, we can actually generate this command and run it within
R, so that we do not need to manually type the command when the list is long. We will
demonstrate this at the end.</p>
<p>For the second issue, we can solve it by utilizing the powerful pipe mechanism in
UNIX-like systems. In short, we uncompress the file using <code>zcat</code>, and send the
streamed data to <code>sed</code> for extraction. For example, if the text file is compressed as
<code>somefile.tar.gz</code>, then the following command reads line 2, 4, and 6 of the original file:</p>
<pre><code>zcat somefile.tar.gz | sed -n '2p;4p;6p;7q'
</code></pre>
<p>Note that we add a <code>7q</code> parameter at the end, which asks <code>sed</code> to exit after reading line 7.
This is <strong>very important</strong> for reading large data sets, since otherwise <code>sed</code> will walk
through the whole data till the end.</p>
<p>Combining things together, we can write a simple R function to accomplish this task
under different scenarios: input file can either be in plain text or be compressed,
results can be saved to a new file or be returned as a character vector, etc.</p>
<pre><code class="language-r">#' Extract lines from a large text file
#'
#' @param infile path to the input file
#' @param lines a vector of line numbers
#' @param outfile if `NULL`, return the result as a vector of character strings;
#' otherwise the path to the output file
#' @param gzip whether the input file is compressed using Gzip
extract_lines = function(infile, lines, outfile = NULL, gzip = FALSE)
{
# e.g., lines = c(1, 20, 100)
lines = as.integer(lines)
# 1p;20p;100p;101q
sed_arg = paste(lines, "p", sep = "", collapse = ";")
sed_arg = paste(sed_arg, ";", max(lines) + 1, "q", sep = "")
# sed -n '1p;20p;100p;101q'
sed_command = sprintf("sed -n '%s'", sed_arg)
# If outfile is not `NULL`, redirect the result to a file
out_command = if(is.null(outfile)) "" else sprintf("> %s", as.character(outfile))
# If file is compressed, combine `zcat` with `sed`
if(gzip)
command = sprintf("zcat %s | %s %s", infile, sed_command, out_command)
else
command = sprintf("%s %s %s", sed_command, infile, out_command)
# Execute command
system(command, intern = is.null(outfile))
}
</code></pre>
Blog not down
https://statr.me/2017/08/blog-not-down/
Sun, 13 Aug 2017 00:00:00 +0000https://statr.me/2017/08/blog-not-down/<p>It has been one year since my last article, and here is a quick post indicating
that my blog is not down. Instead, it has a new look thanks to
<a href="https://github.com/rstudio/blogdown">blogdown</a>. Yes, pun intended. :-)</p>
<p><strong>blogdown</strong>, mostly written by <a href="https://yihui.name/">Yihui</a>, is an R
package that can help you rapidly create a static blog or website.
The package name has nothing to do with the status of a website
(as in “the server is down”), but rather follows the convention of other
<a href="https://daringfireball.net/projects/markdown/syntax">Markdown</a>-based packages
such as <a href="https://github.com/rstudio/rmarkdown">rmarkdown</a> and
<a href="https://github.com/rstudio/bookdown">bookdown</a>. (As for the name of <em>Markdown</em>,
I suspect that it was chosen such that it looked differently from other
popular <em>markup</em> languages at that time such as HTML.)</p>
<p>The blogdown package is based on the <a href="https://gohugo.io/">Hugo</a> blogging system.
Compared with the <a href="https://jekyllrb.com/">Jekyll</a> system that my old blog was based on,
Hugo has some nice features that finally drive me to complete this switch:</p>
<ol>
<li>The installation is very easy. In fact, it provides a single executable file for most
mainstream operating systems.</li>
<li>The blogdown package simplifies this process even more.</li>
<li>Hugo has a decent <a href="https://themes.gohugo.io/">theming system</a>, and more importantly,
it provides a mechanism with which you can import themes developed by others and make
your own modifications without messing up the code.</li>
</ol>
<p>As I mentioned above, blogdown simplifies the procedure of installing Hugo, downloading
a theme, and creating a new site. The following three lines are sufficient for initiating
a new blog:</p>
<pre><code class="language-r">library(blogdown)
install_hugo() ## run once
new_site(theme = "kakawait/hugo-tranquilpeak-theme")
</code></pre>
<p>The <code>theme</code> parameter is optional since it has a default value <code>theme = "yihui/hugo-lithium-theme"</code>.
The value I specified will use the theme at
<a href="https://github.com/kakawait/hugo-tranquilpeak-theme">https://github.com/kakawait/hugo-tranquilpeak-theme</a>
(thanks to the theme author!). Then you get a directory of Hugo source files,
and the generated static HTML pages in the <code>public</code> folder.</p>
<p>Note that unlike Jekyll blogs that can be directly rendered by <a href="https://pages.github.com/">Github Pages</a>,
Hugo blogs are not yet supported by Github. But fortunately, you can deploy your blog on
<a href="https://www.netlify.com/">Netlify</a> by linking a Github repository.
There is a nice <a href="https://www.netlify.com/blog/2016/09/21/a-step-by-step-guide-victor-hugo-on-netlify/">article</a>
talking about this.</p>
<p>For me, the next major steps to fully migrate the blog to Hugo were as follows:</p>
<ol>
<li>Copying <code>*.md</code> files to the <code>content</code> folder of Hugo site.</li>
<li>Making my customizations of the theme under the <code>layouts</code> folder. The layout
files under this folder have higher priority than the imported theme files, so you can
easily override some theme functions without polluting the upstream.</li>
<li>Files in the <code>static</code> folder will be located in the website root directory, for example CSS and images.</li>
<li>Push the directory to Github and link to Netlify.</li>
<li>Point the domain name to the new server address as described in
<a href="https://www.netlify.com/docs/custom-domains/">this document</a>.</li>
</ol>
<p>So far all of my posts are plain Markdown files, meaning that they are not dynamic
documents that contain executable code. But in fact another attractive feature of
blogdown is to allow you writing R-Markdown-style blogs that can be automatically
compiled into Markdown with the rendered output. At the time of writing the blogdown
package is under active development, and hope its first formal release goes to CRAN soon.</p>
Creating pretty documents with the prettydoc package
https://statr.me/2016/08/creating-pretty-documents-with-the-prettydoc-package/
Thu, 11 Aug 2016 00:00:00 +0000https://statr.me/2016/08/creating-pretty-documents-with-the-prettydoc-package/
<blockquote>
<p>Have you ever tried to find a lightweight yet nice theme for the R Markdown
documents, like <a href="https://yixuan.cos.name/prettydoc/cayman.html">this page</a>?</p>
</blockquote>
<h1 id="themes-for-r-markdown">Themes for R Markdown</h1>
<p>With the powerful <a href="https://rmarkdown.rstudio.com/index.html">rmarkdown</a>
package, we could easily create nice HTML document
by adding some meta information in the header, for example</p>
<pre><code class="language-yaml">---
title: Nineteen Years Later
author: Harry Potter
date: July 31, 2016
output:
rmarkdown::html_document:
theme: lumen
---
</code></pre>
<p>The <a href="https://rmarkdown.rstudio.com/html_document_format.html">html_document</a>
engine uses the <a href="https://bootswatch.com/">Bootswatch</a>
theme library to support different styles of the document.
This is a quick and easy way to tune the appearance of your document, yet with
the price of a large file size (> 700KB) since the whole
<a href="https://getbootstrap.com/">Bootstrap</a> library needs to be packed in.</p>
<p>For package vignettes, we can use the
<a href="https://rmarkdown.rstudio.com/package_vignette_format.html">html_vignette</a>
engine to generate a more lightweight HTML file that is meant to minimize the
package size, but the output HTML is less stylish than the <code>html_document</code> ones.</p>
<p>So can we do <strong>BOTH</strong>, a lightweight yet nice-looking theme for R Markdown?</p>
<h1 id="the-prettydoc-engine">The prettydoc Engine</h1>
<p>The answer is YES! (At least towards that direction)</p>
<p>The <a href="https://github.com/yixuan/prettydoc/">prettydoc</a> package
(available on <a href="https://cran.r-project.org/package=prettydoc">CRAN</a>)
provides an alternative engine, <code>html_pretty</code>,
to knit your R Markdown document into pretty HTML pages.
Its usage is extremely easy: simply replace the
<code>rmarkdown::html_document</code> or <code>rmarkdown::html_vignette</code> output engine by
<code>prettydoc::html_pretty</code> in your R Markdown header, and use one of the built-in
themes and syntax highlighters. For example</p>
<pre><code class="language-yaml">---
title: Nineteen Years Later
author: Harry Potter
date: July 31, 2016
output:
prettydoc::html_pretty:
theme: cayman
highlight: github
---
</code></pre>
<p>You can also create documents from <strong>prettydoc</strong> templates in RStudio (after
installing the package).</p>
<p><strong>Step 1:</strong> Click the “New File” button and choose “R Markdown”.</p>
<div align="center">
<img src="https://yixuan.cos.name/prettydoc/images/step1.png" alt="Step 1" />
</div>
<p><strong>Step 2:</strong> In the “From Template” tab, choose one of the built-in templates.</p>
<div align="center">
<img src="https://yixuan.cos.name/prettydoc/images/step2.png" alt="Step 2" />
</div>
<h1 id="options-and-themes">Options and Themes</h1>
<p>The options for the <code>html_pretty</code> engine are fully compatible with the default
<code>html_document</code>
(see the <a href="https://rmarkdown.rstudio.com/html_document_format.html">documentation</a>)
with two exceptions:</p>
<ol>
<li>The <code>theme</code> option can take value from <code>cayman</code>, <code>tactile</code> and
<code>architect</code>. More themes will be added in the future. The themes contained in
<strong>prettydoc</strong> are much inspired by and modified from
various <a href="https://github.com/blog/1081-instantly-beautiful-project-pages">Github page themes</a>.</li>
<li>The <code>highlight</code> options takes value from <code>github</code> and <code>vignette</code>.</li>
</ol>
<h1 id="gallery">Gallery</h1>
<p>Here are some screenshots of the HTML pages generated by <strong>prettydoc</strong> with
different themes and syntax highlighters.</p>
<div align="center">
<h2>Cayman <a href="https://yixuan.cos.name/prettydoc/cayman.html">(demo page)</a></h2>
<a href="https://yixuan.cos.name/prettydoc/cayman.html">
<img width="600px" src="https://yixuan.cos.name/prettydoc/images/cayman.png" alt="Cayman Theme" />
</a>
</div>
<div align="center">
<h2>Tactile <a href="https://yixuan.cos.name/prettydoc/tactile.html">(demo page)</a></h2>
<a href="https://yixuan.cos.name/prettydoc/tactile.html">
<img width="600px" src="https://yixuan.cos.name/prettydoc/images/tactile.png" alt="Tactile Theme" />
</a>
</div>
<div align="center">
<h2>Architect <a href="https://yixuan.cos.name/prettydoc/architect.html">(demo page)</a></h2>
<a href="https://yixuan.cos.name/prettydoc/architect.html">
<img width="600px" src="https://yixuan.cos.name/prettydoc/images/architect.png" alt="Architect Theme" />
</a>
</div>
<p>If you think this package is helpful, feel free to leave comments or
request features in the <a href="https://github.com/yixuan/prettydoc/">Github repository</a>.
Contribution and pull requests are always welcome.</p>
recosystem: recommender system using parallel matrix factorization
https://statr.me/2016/07/recommender-system-using-parallel-matrix-factorization/
Fri, 15 Jul 2016 00:00:00 +0000https://statr.me/2016/07/recommender-system-using-parallel-matrix-factorization/
<h1 id="a-quick-view-of-recommender-system">A Quick View of Recommender System</h1>
<p>The main task of recommender system is to predict unknown entries in the
rating matrix based on observed values, as is shown in the table below:</p>
<div align="center">
<img src="https://i.imgur.com/bmW79NS.png" alt="Rating matrix" />
</div>
<p>Each cell with number in it is the rating given by some user on a specific
item, while those marked with question marks are unknown ratings that need
to be predicted. In some other literatures, this problem may be named
collaborative filtering, matrix completion, matrix recovery, etc.</p>
<p>A popular technique to solve the recommender system problem is the matrix
factorization method. The idea is to approximate the whole rating matrix
$R_{m\times n}$ by the product of two matrices of lower dimensions,
$P_{k\times m}$ and $Q_{k\times n}$, such that</p>
<p>$$R\approx P^\prime Q$$</p>
<p>Let $p_u$ be the $u$-th column of $P$, and $q_v$ be the
$v$-th column of $Q$, then the rating given by user $u$ on item $v$
would be predicted as $p^\prime_u q_v$.</p>
<p>A typical solution for $P$ and $Q$ is given by the following optimization
problem [<a href="#FPSG2015">1</a>; <a href="#LRSG">2</a>]:</p>
<p>$$\min_{P,Q} \sum_{(u,v)\in R} \left[f(p_u,q_v;r_{u,v})+\mu_P||p_u||_1+\mu_Q||q_v||_1+\frac{\lambda_P}{2} ||p_u||_2^2+\frac{\lambda_Q}{2} ||q_v||_2^2\right]$$</p>
<p>where $(u,v)$ are locations of observed entries in $R$, $r_{u,v}$ is
the observed rating, $f$ is the loss function, and
$\mu_P,\mu_Q,\lambda_P,\lambda_Q$ are penalty parameters
to avoid overfitting.</p>
<p>The process of solving the matrices $P$ and $Q$ is called
model training, and the selection of penalty parameters is
parameter tuning. After obtaining $P$ and $Q$, we can then do
the prediction of $\hat{R}_{u,v}=p^\prime_u q_v$.</p>
<h1 id="libmf-and-recosystem">LIBMF and recosystem</h1>
<p><a href="https://www.csie.ntu.edu.tw/~cjlin/libmf/">LIBMF</a>
is an open source C++ library for recommender system using parallel
matrix factorization, developed by
<a href="https://www.csie.ntu.edu.tw/~cjlin/">Dr. Chih-Jen Lin</a> and his research group.
[<a href="#LIBMF">3</a>]</p>
<p>LIBMF is a parallelized library, meaning that
users can take advantage of multi-core CPUs to speed up the computation.
It also utilizes some advanced CPU features to further improve the performance.</p>
<p><a href="https://cran.r-project.org/package=recosystem">recosystem</a>
(<a href="https://github.com/yixuan/recosystem">Github</a>) is an R wrapper of
the LIBMF library that inherits most of its features. Additionally, this
package provides a number of user-friendly R functions to
simplify data processing and model building. Also, unlike most other R packages
for statistical modeling that store the whole dataset and model object in
memory, LIBMF (and hence <code>recosystem</code>) can significantly reduce memory use,
for instance the constructed model that contains information for prediction
can be stored in the hard disk, and output result can also be directly
written into a file rather than be kept in memory.</p>
<h1 id="overview-of-recosystem">Overview of recosystem</h1>
<p>The usage of <code>recosystem</code> is quite simple, mainly consisting of the following steps:</p>
<ol>
<li>Create a model object (a Reference Class object in R) by calling <code>Reco()</code>.</li>
<li>Specify the data source, either from a data file or from R objects in memory.</li>
<li>Train the model by calling the <code>$train()</code> method. A number of parameters
can be set inside the function.</li>
<li>(Optionally) Call the <code>$tune()</code> method to select best tuning parameters
along a set of candidate values, in order to achieve better model performance.</li>
<li>(Optionally) Export the model via <code>$output()</code>, i.e. write the factorization matrices
$P$ and $Q$ into files or return them as R objects.</li>
<li>Use the <code>$predict()</code> method to compute predicted values.</li>
</ol>
<p>More details are covered in the package
<a href="https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html">vignette</a>
and the help pages <code>?recosystem::Reco</code>, <code>?recosystem::data_source</code>, <code>?recosystem::train</code>,
<code>?recosystem::tune</code>, <code>?recosystem::output</code>, and <code>?recosystem::predict</code>.</p>
<p>In the next section we will demonstrate how to use <code>recosystem</code> to analyze a
real movie recommendation data set.</p>
<h1 id="movielens-data">MovieLens Data</h1>
<p>The <a href="https://movielens.org/">MovieLens</a> website collected many
movie rating data for research use. [<a href="#MovieLens">4</a>] In this article we download the
<a href="https://files.grouplens.org/datasets/movielens/ml-1m.zip">MovieLens 1M Dataset</a>
from <a href="https://grouplens.org/datasets/movielens/">grouplens</a>,
which contains 1 million ratings from 6000 users and 4000 movies.</p>
<p>The rating data file, <code>ratings.dat</code>, looks like below:</p>
<pre><code>1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
...
</code></pre>
<p>Each line has the format <code>UserID::MovieID::Rating::Timestamp</code>,
for example the first line says that User #1 gave Movie #1193 a rating of
5 at certain time point.</p>
<p>In <code>recosystem</code>, we will not use the time information, and the required data
format is <code>UserID MovieID Rating</code>, i.e., the columns are space-separated, and
columns after <code>Rating</code> will be ignored.
Therefore, we first transform the data file into a format that is supported by
<code>recosystem</code>. On Unix-like OS’s, we can use the <code>sed</code> command to replace <code>::</code>
by a space:</p>
<pre><code class="language-bash">sed -e 's/::/ /g' ratings.dat > ratings2.dat
</code></pre>
<p>Then we can start to train a recommender, as the following code shows:</p>
<pre><code class="language-r">library(recosystem) # 1
r = Reco() # 2
train_set = data_file("ratings2.dat", index1 = TRUE) # 3
r$train(train_set, opts = list(dim = 20, # 4
costp_l1 = 0, costp_l2 = 0.01, # 5
costq_l1 = 0, costq_l2 = 0.01, # 6
niter = 10, # 7
nthread = 4)) # 8
</code></pre>
<p>In the code above, line 2 creates a model object such that the training
function <code>$train()</code> can be called from it. Line 3 specifies the data
source – a data file on hard disk. Since in our data user ID and movie ID
start from 1 rather than 0, we use the <code>index1 = TRUE</code> options in the function.</p>
<p>The data can also be read from memory, if the UserID, MovieID and Rating
columns are stored as R vectors. Below shows an alternative way to provide
the training set.</p>
<pre><code class="language-r">dat = read.table("ratings2.dat", sep = " ", header = FALSE,
colClasses = c(rep("integer", 3), "NULL"))
train_set = data_memory(user_index = dat[, 1],
item_index = dat[, 2],
rating = dat[, 3], index1 = TRUE)
</code></pre>
<p>Line 4 to line 6 set the relevant model parameters: $k, \mu_P,\mu_Q,\lambda_P$,
and $\lambda_Q$, and Line 7 gives the number of iterations. Finally as I have
mentioned previously, LIBMF is a parallelized library, so
users can specify the number of threads that will be working
simultaneously via the <code>nthread</code> parameter. However, when <code>nthread > 1</code>,
the training result is <strong>NOT</strong> guaranteed to be reproducible, even if
a random seed is set.</p>
<p>Now everything looks good, except one inadequacy: the setting of tuning
parameters is ad-hoc, which may make the model sub-optimal.
To tune these parameters, we can call the <code>$tune()</code> function to test
a set of candidate values and use cross validation
to evaluate their performance. Below shows this process:</p>
<pre><code class="language-r">opts_tune = r$tune(train_set, # 9
opts = list(dim = c(10, 20, 30), # 10
costp_l2 = c(0.01, 0.1), # 11
costq_l2 = c(0.01, 0.1), # 12
costp_l1 = 0, # 13
costq_l1 = 0, # 14
lrate = c(0.01, 0.1), # 15
nthread = 4, # 16
niter = 10, # 17
verbose = TRUE)) # 18
r$train(train_set, opts = c(opts_tune$min, # 19
niter = 100, nthread = 4)) # 20
</code></pre>
<p>The options in line 9 to line 15 are tuning parameters. The tuning function
will evaluate each combination of them and calculate the associated
cross-validated RMSE. The parameter set with the smallest RMSE will be contained
in the returned value, which can then be passed to <code>$train()</code> (Line 19-20).</p>
<p>Finally, we can use the model object to do predictions. The code
below shows how to predict ratings given by the first 20 users
on the first 20 movies.</p>
<pre><code class="language-r">user = 1:20
movie = 1:20
pred = expand.grid(user = user, movie = movie)
test_set = data_memory(pred$user, pred$movie, index1 = TRUE)
pred$rating = r$predict(test_set, out_memory())
library(ggplot2)
ggplot(pred, aes(x = movie, y = user, fill = rating)) +
geom_raster() +
scale_fill_gradient("Rating", low = "#d6e685", high = "#1e6823") +
xlab("Movie ID") + ylab("User ID") +
coord_fixed() +
theme_bw(base_size = 22)
</code></pre>
<div align="center">
<img src="https://i.imgur.com/nFGyyaO.png" alt="Predicted ratings" />
</div>
<h1 id="performance">Performance</h1>
<p>To make the best use of <code>recosystem</code>, the parallel computing option <code>nthread</code>
should be used in the training and tuning step. Also, LIBMF and recosystem can
make use of some advanced CPU features to speed-up computation, if you
compile the package from source and turn on some compiler options.</p>
<p>To build <code>recosystem</code>, one needs a C++ compiler that supports
the C++11 standard. Then you can edit <code>src/Makevars</code> (<code>src/Makevars.win</code> for Windows system)
according to the following guideline:</p>
<ul>
<li>The default <code>Makevars</code> provides generic options that should apply to most
CPUs.</li>
<li><p>If your CPU supports SSE3
(<a href="https://en.wikipedia.org/wiki/SSE3">a list of supported CPUs</a>), add</p>
<pre><code>PKG_CPPFLAGS += -DUSESSE
PKG_CXXFLAGS += -msse3
</code></pre></li>
<li><p>If not only SSE3 is supported but also AVX
(<a href="https://en.wikipedia.org/wiki/Advanced_Vector_Extensions">a list of supported CPUs</a>), add</p>
<pre><code>PKG_CPPFLAGS += -DUSEAVX
PKG_CXXFLAGS += -mavx
</code></pre></li>
</ul>
<p>After editing the <code>Makevars</code> file, run <code>R CMD INSTALL recosystem</code> to install <code>recosystem</code>.</p>
<p>The plot below shows the effect of parallel computing and the compiler option on
the performance of computation. The y axis is the elapsed time of the model tuning
procedure in the previous example.</p>
<div align="center">
<img src="https://i.imgur.com/GfgShWZ.png" alt="Performance" />
</div>
<h2 id="references">References</h2>
<div id="FPSG2015"></div>
<p>[1] Chin, Wei-Sheng, Yong Zhuang, Yu-Chin Juan, and Chih-Jen Lin. 2015a. <a href="https://www.csie.ntu.edu.tw/~cjlin/papers/libmf/libmf_journal.pdf"><em>A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems</em></a>. ACM TIST.</p>
<div id="LRSG"></div>
<p>[2] Chin, Wei-Sheng, Yong Zhuang, Yu-Chin Juan, and Chih-Jen Lin. 2015b. <a href="https://www.csie.ntu.edu.tw/~cjlin/papers/libmf/mf_adaptive_pakdd.pdf"><em>A Learning-Rate Schedule for Stochastic Gradient Methods to Matrix Factorization</em></a>. ACM TIST.</p>
<div id="LIBMF"></div>
<p>[3] Lin, Chih-Jen, Yu-Chin Juan, Yong Zhuang, and Wei-Sheng Chin. 2015. <a href="https://www.csie.ntu.edu.tw/~cjlin/libmf/"><em>LIBMF: A Matrix-Factorization Library for Recommender Systems</em></a>.</p>
<div id="MovieLens"></div>
<p>[4] F. Maxwell Harper and Joseph A. Konstan. 2015. <a href="https://dx.doi.org/10.1145/2827872"><em>The MovieLens Datasets: History
and Context</em></a>. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4,
Article 19 (December 2015), 19 pages.</p>
RcppNumerical: numerical integration and optimization with Rcpp
https://statr.me/2016/04/rcppnumerical-numerical-integration-optimization-rcpp/
Sat, 09 Apr 2016 00:00:00 +0000https://statr.me/2016/04/rcppnumerical-numerical-integration-optimization-rcpp/
<h1 id="introduction">Introduction</h1>
<p>I have seen several conversations in Rcpp-devel mailing list asking how to
compute numerical integration or optimization in Rcpp. While R in fact
has the functions <code>Rdqags</code>, <code>Rdqagi</code>, <code>nmmin</code>, <code>vmmin</code> etc. in its API
to accomplish such tasks, it is not so straightforward to use them with Rcpp.</p>
<p>For my own research projects I need to do a lot of numerical integration,
root finding and optimization, so to make my life a little bit easier, I
just created the <a href="https://github.com/yixuan/RcppNumerical">RcppNumerical</a>
package that simplifies these procedures. I haven’t submitted <code>RcppNumerical</code>
to CRAN, since the API may change quickly according to my needs or the
feedbacks from other people.</p>
<p>Basically <code>RcppNumerical</code> includes a number of open source libraries for
numerical computing, so that Rcpp code can link to this package to use
the functions provided by these libraries. Alternatively, <code>RcppNumerical</code>
provides some wrapper functions that have less configuration and fewer
arguments, if you just want to use the default and quickly get the results.</p>
<p><code>RcppNumerical</code> depends on <code>Rcpp</code> (obviously) and <code>RcppEigen</code>,</p>
<ul>
<li>To use <code>RcppNumerical</code> with <code>Rcpp::sourceCpp()</code>, add the following two lines
to the C++ source file:</li>
</ul>
<pre><code class="language-cpp">// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::depends(RcppNumerical)]]
</code></pre>
<ul>
<li><p>To use <code>RcppNumerical</code> in your package, add the corresponding fields to the
<code>DESCRIPTION</code> file:</p>
<pre><code>Imports: RcppNumerical
LinkingTo: Rcpp, RcppEigen, RcppNumerical
</code></pre></li>
</ul>
<p>Also in the <code>NAMESPACE</code> file, add:</p>
<pre><code>import(RcppNumerical)
</code></pre>
<h1 id="numerical-integration">Numerical Integration</h1>
<div align="center">
<img src="https://i.imgur.com/NYIAs1J.png" width="350px" />
<p>(Picture from <a href="https://en.wikipedia.org/wiki/Integral">Wikipedia</a>)</p>
</div>
<p>The numerical integration code contained in <code>RcppNumerical</code> is based
on the <a href="https://github.com/tbs1980/NumericalIntegration">NumericalIntegration</a>
library developed by <a href="https://github.com/tbs1980">Sreekumar Thaithara Balan</a>,
<a href="https://github.com/mcsauder">Mark Sauder</a>, and Matt Beall.</p>
<p>To compute integration of a function, first define a functor inherited from
the <code>Func</code> class:</p>
<pre><code class="language-cpp">class Func
{
public:
virtual double operator()(const double& x) const = 0;
virtual void operator()(double* x, const int n) const
{
for(int i = 0; i < n; i++)
x[i] = this->operator()(x[i]);
}
};
</code></pre>
<p>The first function evaluates one point at a time, and the second version
overwrites each point in the array by the corresponding function values.
Only the second function will be used by the integration code, but usually it
is easier to implement the first one.</p>
<p><code>RcppNumerical</code> provides a wrapper function for the <code>NumericalIntegration</code>
library with the following interface:</p>
<pre><code class="language-cpp">inline double integrate(
const Func& f, const double& lower, const double& upper,
double& err_est, int& err_code,
const int subdiv = 100,
const double& eps_abs = 1e-8, const double& eps_rel = 1e-6,
const Integrator<double>::QuadratureRule rule = Integrator<double>::GaussKronrod41
)
</code></pre>
<p>See the <a href="https://github.com/yixuan/RcppNumerical">README</a> page for the
explanation of each argument. Below shows an example that calculates the
moment generating function of a $Beta(a,b)$ distribution,
$M(t) = E(e^{tX})$:</p>
<pre><code class="language-cpp">// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::depends(RcppNumerical)]]
#include <RcppNumerical.h>
using namespace Numer;
// M(t) = E(exp(t * X)) = int exp(t * x) * f(x) dx, f(x) is the p.d.f.
class Mintegrand: public Func
{
private:
const double a;
const double b;
const double t;
public:
Mintegrand(double a_, double b_, double t_) : a(a_), b(b_), t(t_) {}
double operator()(const double& x) const
{
return std::exp(t * x) * R::dbeta(x, a, b, 0);
}
};
// [[Rcpp::export]]
double beta_mgf(double t, double a, double b)
{
Mintegrand f(a, b, t);
double err_est;
int err_code;
return integrate(f, 0.0, 1.0, err_est, err_code);
}
</code></pre>
<p>We can compile and run this code in R and draw the graph:</p>
<pre><code class="language-r">library(Rcpp)
library(ggplot2)
sourceCpp("somefile.cpp")
t0 = seq(-3, 3, by = 0.1)
mt = sapply(t0, beta_mgf, a = 1, b = 1)
qplot(t0, mt, geom = "line", xlab = "t", ylab = "M(t)",
main = "Moment generating function of Beta(1, 1)")
</code></pre>
<div align="center">
<img src="https://i.imgur.com/2ZDJH7X.png" width="500px" />
</div>
<h1 id="numerical-optimization">Numerical Optimization</h1>
<p>Currently <code>RcppNumerical</code> contains the L-BFGS algorithm for unconstrained
minimization problems based on the
<a href="https://github.com/chokkan/liblbfgs">libLBFGS</a> library
developed by <a href="http://www.chokkan.org/">Naoaki Okazaki</a>.</p>
<div align="center">
<img src="https://aria42.com/images/steepest-descent.png" width="400px" />
<p>(Picture from <a href="https://aria42.com/blog/2014/12/understanding-lbfgs">aria42.com</a>)</p>
</div>
<p>Again, one needs to first define a functor to represent the multivariate
function to be minimized.</p>
<pre><code class="language-cpp">class MFuncGrad
{
public:
virtual double f_grad(Constvec& x, Refvec grad) = 0;
};
</code></pre>
<p>Here <code>Constvec</code> represents a read-only vector and <code>Refvec</code> a writable
vector. Their definitions are</p>
<pre><code class="language-cpp">// Reference to a vector
typedef Eigen::Ref<Eigen::VectorXd> Refvec;
typedef const Eigen::Ref<const Eigen::VectorXd> Constvec;
</code></pre>
<p>(Basically you can treat <code>Refvec</code> as a <code>Eigen::VectorXd</code> and
<code>Constvec</code> the <code>const</code> version. Using <code>Eigen::Ref</code> is mainly to avoid
memory copy. See the explanation
<a href="https://eigen.tuxfamily.org/dox/classEigen_1_1Ref.html">here</a>.)</p>
<p>The <code>f_grad()</code> member function returns the function value on vector <code>x</code>,
and overwrites <code>grad</code> by the gradient.</p>
<p>The wrapper function for libLBFGS is</p>
<pre><code class="language-cpp">inline int optim_lbfgs(
MFuncGrad& f, Refvec x, double& fx_opt,
const int maxit = 300,
const double& eps_f = 1e-6, const double& eps_g = 1e-5
)
</code></pre>
<p>Also refer to the <a href="https://github.com/yixuan/RcppNumerical">README</a> page for
details and see the logistic regression example below.</p>
<h1 id="fast-logistic-regression-an-example">Fast Logistic Regression: An Example</h1>
<p>Let’s see a realistic example that uses the optimization library to fit a
logistic regression.</p>
<p>Given a data matrix $X$ and a 0-1 valued vector $Y$, we want to find a
coefficient vector $\beta$ such that the negative log-likelihood function is
minimized:</p>
<p>$$\min<em>{\beta} -l(\beta)=\sum</em>{i=1}^n\left[ \log(1+\exp(x_i’\beta)) - y_i x_i’\beta\right]$$</p>
<p>The gradient function is</p>
<p>$$g(\beta)=X’(p(\beta)-Y),\quad p(\beta)=\frac{1}{1+\exp(-X\beta)}$$</p>
<p>So we can write the code as follows:</p>
<pre><code class="language-cpp">// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::depends(RcppNumerical)]]
#include <RcppNumerical.h>
using namespace Numer;
using Rcpp::NumericVector;
using Rcpp::NumericMatrix;
typedef Eigen::Map<Eigen::MatrixXd> MapMat;
typedef Eigen::Map<Eigen::VectorXd> MapVec;
class LogisticReg: public MFuncGrad
{
private:
const MapMat X;
const MapVec Y;
public:
LogisticReg(const MapMat x_, const MapVec y_) : X(x_), Y(y_) {}
double f_grad(Constvec& beta, Refvec grad)
{
// Negative log likelihood
// sum(log(1 + exp(X * beta))) - y' * X * beta
Eigen::VectorXd xbeta = X * beta;
const double yxbeta = Y.dot(xbeta);
// X * beta => exp(X * beta)
xbeta = xbeta.array().exp();
const double f = (xbeta.array() + 1.0).log().sum() - yxbeta;
// Gradient
// X' * (p - y), p = exp(X * beta) / (1 + exp(X * beta))
// exp(X * beta) => p
xbeta.array() /= (xbeta.array() + 1.0);
grad.noalias() = X.transpose() * (xbeta - Y);
return f;
}
};
// [[Rcpp::export]]
NumericVector logistic_reg(NumericMatrix x, NumericVector y)
{
const MapMat xx = Rcpp::as<MapMat>(x);
const MapVec yy = Rcpp::as<MapVec>(y);
// Negative log likelihood
LogisticReg nll(xx, yy);
// Initial guess
Eigen::VectorXd beta(xx.cols());
beta.setZero();
double fopt;
int status = optim_lbfgs(nll, beta, fopt);
if(status < 0)
Rcpp::stop("fail to converge");
return Rcpp::wrap(beta);
}
</code></pre>
<p>Now let’s do a quick benchmark:</p>
<pre><code class="language-r">set.seed(123)
n = 5000
p = 100
x = matrix(rnorm(n * p), n)
beta = runif(p)
xb = c(x %*% beta)
p = exp(xb) / (1 + exp(xb))
y = rbinom(n, 1, p)
system.time(res1 <- glm.fit(x, y, family = binomial())$coefficients)
## user system elapsed
## 0.339 0.004 0.342
system.time(res2 <- logistic_reg(x, y))
## user system elapsed
## 0.01 0.00 0.01
max(abs(res1 - res2))
## [1] 1.977189e-07
</code></pre>
<p>This is not a fair comparison however, since <code>glm.fit()</code> will calculate some
other components besides $\beta$, and the precision of two methods are also
different.</p>
<p><code>RcppNumerical</code> provides a function <code>fastLR()</code> that is a more stable
version of the code above (avoiding <code>exp()</code> overflow) and returns similar
components as <code>glm.fit()</code>. The performance is similar:</p>
<pre><code class="language-r">system.time(res3 <- fastLR(x, y)$coefficients)
## user system elapsed
## 0.01 0.00 0.01
max(abs(res1 - res3))
## [1] 1.977189e-07
</code></pre>
<p>Its source code can be found
<a href="https://github.com/yixuan/RcppNumerical/blob/master/src/fastLR.cpp">here</a>.</p>
<h1 id="final-words">Final Words</h1>
<p>If you think this package may be helpful, feel free to leave comments or
request features in the <a href="https://github.com/yixuan/RcppNumerical/issues">Github</a>
page. Contribution and pull requests would be great.</p>
Large scale eigenvalue decomposition and SVD with rARPACK
https://statr.me/2016/02/large-scale-eigen-and-svd-with-rarpack/
Sat, 20 Feb 2016 00:00:00 +0000https://statr.me/2016/02/large-scale-eigen-and-svd-with-rarpack/
<blockquote>
<p>In January 2016, I was honored to receive an “Honorable Mention” of the
<a href="http://stat-computing.org/awards/jmc/">John Chambers Award 2016</a>.
This article was written for <a href="https://www.r-bloggers.com/">R-bloggers</a>,
whose builder, Tal Galili, kindly invited me
to write an introduction to the <code>rARPACK</code> package.</p>
</blockquote>
<h1 id="a-short-story-of-rarpack">A Short Story of rARPACK</h1>
<p>Eigenvalue decomposition is a commonly used technique in
numerous statistical problems. For example, principal component analysis (PCA)
basically conducts eigenvalue decomposition on the sample covariance of a data
matrix: the eigenvalues are the component variances, and eigenvectors are the
variable loadings.</p>
<p>In R, the standard way to compute eigenvalues is the <code>eigen()</code> function.
However, when the matrix becomes large, <code>eigen()</code> can be very time-consuming:
the complexity to calculate all eigenvalues of a $n \times n$ matrix is
$O(n^3)$.</p>
<p>While in real applications, we usually only need to compute a few
eigenvalues or eigenvectors, for example to visualize high dimensional
data using PCA, we may only use the first two or three components to draw
a scatterplot. Unfortunately in <code>eigen()</code>, there is no option to limit the
number of eigenvalues to be computed. This means that we always need to do the
full eigen decomposition, which can cause a huge waste in computation.</p>
<p>And this is why the <a href="https://cran.r-project.org/package=rARPACK"><code>rARPACK</code></a>
package was developed. As the name indicates,
<code>rARPACK</code> was originally an R wrapper of the
<a href="http://www.caam.rice.edu/software/ARPACK/">ARPACK</a> library, a FORTRAN package
that is used to calculate a few eigenvalues of a square matrix. However
ARPACK has stopped development for a long time, and it has some compatibility
issues with the current version of LAPACK. Therefore to maintain <code>rARPACK</code> in a
good state, I wrote a new backend for <code>rARPACK</code>, and that is the C++ library
<a href="https://spectralib.org/">Spectra</a>.</p>
<p>The name of <code>rARPACK</code> was POORLY designed, I admit. Starting from version
0.8-0, <code>rARPACK</code> no longer relies on ARPACK, but due to CRAN polices and
reverse dependence, I have to keep using the old name.</p>
<h1 id="features-and-usage">Features and Usage</h1>
<p>The usage of <code>rARPACK</code> is simple. If you want to calculate some eigenvalues
of a square matrix <code>A</code>, just call the function <code>eigs()</code> and tells it how many
eigenvalues you want (argument <code>k</code>), and which eigenvalues to calculate
(argument <code>which</code>). By default, <code>which = "LM"</code> means to pick the eigenvalues
with the largest magnitude (modulus for complex numbers and absolute value
for real numbers). If the matrix is known to be symmetric, calling
<code>eigs_sym()</code> is preferred since it guarantees that the eigenvalues are real.</p>
<pre><code class="language-r">library(rARPACK)
set.seed(123)
## Some random data
x = matrix(rnorm(1000 * 100), 1000)
## If retvec == FALSE, we don't calculate eigenvectors
eigs_sym(cov(x), k = 5, which = "LM", opts = list(retvec = FALSE))
</code></pre>
<p>For really large data, the matrix is usually in sparse form. <code>rARPACK</code>
supports several sparse matrix types defined in the <code>Matrix</code>
package, and you can even pass an implicit matrix defined by a function to
<code>eigs()</code>. See <code>?rARPACK::eigs</code> for details.</p>
<pre><code class="language-r">library(Matrix)
spmat = as(cov(x), "dgCMatrix")
eigs_sym(spmat, 2)
## Implicitly define the matrix by a function that calculates A %*% x
## Below represents a diagonal matrix diag(c(1:10))
fmat = function(x, args)
{
return(x * (1:10))
}
eigs_sym(fmat, 3, n = 10, args = NULL)
</code></pre>
<h1 id="from-eigenvalue-to-svd">From Eigenvalue to SVD</h1>
<p>An extension to eigenvalue decomposition is the singular value decomposition
(SVD), which works for general rectangular matrices. Still take PCA as
an example. To calculate variable loadings, we can perform an SVD on the
centered data matrix, and the loadings will be contained in the right singular
vectors. This method avoids computing the covariance matrix, and is generally
more stable and accurate than using <code>cov()</code> and <code>eigen()</code>.</p>
<p>Similar to <code>eigs()</code>, <code>rARPACK</code> provides the function <code>svds()</code> to conduct
partial SVD, meaning that only part of the singular pairs (values and vectors)
are to be computed. Below shows an example that computes the first three PCs
of a 2000x500 matrix, and I compare the timings of three different algorithms:</p>
<pre><code class="language-r">library(microbenchmark)
set.seed(123)
## Some random data
x = matrix(rnorm(2000 * 500), 2000)
pc = function(x, k)
{
## First center data
xc = scale(x, center = TRUE, scale = FALSE)
## Partial SVD
decomp = svds(xc, k, nu = 0, nv = k)
return(list(loadings = decomp$v, scores = xc %*% decomp$v))
}
microbenchmark(princomp(x), prcomp(x), pc(x, 3), times = 5)
</code></pre>
<p>The <code>princomp()</code> and <code>prcomp()</code> functions are the standard approaches in R
to do PCA, which will call <code>eigen()</code> and <code>svd()</code> respectively.
On my machine (Fedora Linux 23, R 3.2.3 with optimized single-threaded
OpenBLAS), the timing results are as follows:</p>
<pre><code>Unit: milliseconds
expr min lq mean median uq max neval
princomp(x) 274.7621 276.1187 304.3067 288.7990 289.5324 392.3211 5
prcomp(x) 306.4675 391.9723 408.9141 396.8029 397.3183 552.0093 5
pc(x, 3) 162.2127 163.0465 188.3369 163.3839 186.1554 266.8859 5
</code></pre>
<h1 id="applications">Applications</h1>
<p>SVD has some interesting applications, and one of them is image compression.
The basic idea is to perform a partial SVD on the image matrix, and then recover
it using the calculated singular values and singular vectors.</p>
<p>Below shows an image of size 622x1000:
<div align="center">
<img src="https://i.imgur.com/VfmfWJi.jpg" width="500px" />
<p>(Orignal image)</p>
</div></p>
<p>If we use the first five singular pairs to recover the image,
then we need to store 8115 elements, which is only 1.3% of the original data
size. The recovered image will look like below:</p>
<div align="center">
<img src="https://i.imgur.com/U2dYWHb.jpg" width="500px" />
<p>(5 singular pairs)</p>
</div>
<p>Even if the recovered image is quite blurred, it already reveals the main
structure of the original image. And if we increase the number of singular pairs
to 50, then the difference is almost imperceptible, as is shown below.</p>
<div align="center">
<img src="https://i.imgur.com/rWSYG7B.jpg" width="500px" />
<p>(50 singular pairs)</p>
</div>
<p>There is also a nice <a href="https://yihui.shinyapps.io/imgsvd/">Shiny App</a>
developed by <a href="https://nanx.me/">Nan Xiao</a>,
<a href="https://yihui.name/">Yihui Xie</a> and <a href="http://www.sfu.ca/~hetongh/">Tong He</a> that
allows users to upload an image and visualize the effect of compression using
this algorithm. The code is available on
<a href="https://github.com/road2stat/imgsvd">GitHub</a>.</p>
<h1 id="performance">Performance</h1>
<p>Finally, I would like to use some benchmark results to show the
performance of <code>rARPACK</code>. As far as I know, there are very few packages
available in R that can do the partial eigenvalue decomposition, so the results
here are based on partial SVD.</p>
<p>The first plot compares different SVD functions on a 1000x500 matrix,
with dense format on the left panel, and sparse format on the right.</p>
<div align="center">
<img src="https://i.imgur.com/6TyJruc.png" width="700px" />
</div>
<p>The second plot shows the results on a 5000x2500 matrix.</p>
<div align="center">
<img src="https://i.imgur.com/CTtZieD.png" width="700px" />
</div>
<p>The functions used corresponding to the axis labels are as follows:</p>
<ul>
<li>svd: <code>svd()</code> from base R, which computes the full SVD</li>
<li>irlba: <code>irlba()</code> from <a href="https://cran.r-project.org/package=irlba"><code>irlba</code></a>
package, partial SVD</li>
<li>propack, trlan: <code>propack.svd()</code> and <code>trlan.svd()</code> from
<a href="https://cran.r-project.org/package=svd"><code>svd</code></a> package, partial SVD</li>
<li>svds: <code>svds()</code> from <code>rARPACK</code></li>
</ul>
<p>The code for benchmark and the environment to run the code can be
found <a href="https://spectralib.org/performance.html">here</a>.</p>
An overview of linear algebra libraries in Scala/Java
https://statr.me/2015/09/an-overview-of-linear-algebra-libraries-in-scala-java/
Sat, 19 Sep 2015 00:00:00 +0000https://statr.me/2015/09/an-overview-of-linear-algebra-libraries-in-scala-java/
<p>This semester I’m taking a course in big data computing using Scala/Spark, and
we are asked to finish a course project related to big data analysis. Since
statistical modeling heavily relies on linear algebra, I investigated some
existing libraries in Scala/Java that deal with matrix and linear algebra
algorithms.</p>
<h1 id="1-set-up">1. Set-up</h1>
<p>Scala/Java libraries are usually distributed as <code>*.jar</code> files. To use them in Scala,
we can create a directory to hold them and set up the environment variable to let
Scala know about this path. For example, we first create a folder named <code>scala_lib</code>
in home directory, and then edit the <code>.bash_profile</code> file
(create one if it does not exist), adding the following line:</p>
<pre><code class="language-bash">export CLASSPATH=$CLASSPATH:~/scala_lib/*
</code></pre>
<p>To make it effective for the current session, type in the terminal</p>
<pre><code class="language-bash">source .bash_profile
</code></pre>
<p>Then the <code>.jar</code> files can be downloaded to this directory and Scala will recognize it.</p>
<h1 id="2-common-operations">2. Common operations</h1>
<p>Most of the libraries discussed here support basic matrix operations, such as
creating a matrix, getting and setting elements or sub-matrices, matrix multiplications,
solving linear equations, etc.</p>
<p>For each library discussed below, we use it to accomplish the following tasks, in
order to demonstrate the usage of the library:</p>
<ol>
<li>Create a 3 by 6 matrix $A$</li>
<li>Fill the matrix with random numbers</li>
<li>Set $A_{1,1}=A_{3,6}$</li>
<li>Get the 3 by 3 sub-matrix $B=A_{1:3,1:3}$</li>
<li>Set the sub-matrix $A_{1:3,2:4}=B$</li>
<li>Calculate the matrix product $C=A^\prime B$</li>
<li>Solve linear equation $Bx=a$, where $a$ is the first column of $A$</li>
</ol>
<h1 id="3-jama">3. JAMA</h1>
<h2 id="introduction">Introduction</h2>
<p><a href="http://math.nist.gov/javanumerics/jama/">JAMA</a> is a basic linear algebra package
for Java which provides a matrix class and a number of matrix decomposition classes.
The matrix class supports basic operations such as matrix addition, multiplication,
transpose, norm calculation etc.</p>
<h2 id="installation">Installation</h2>
<p>Simply download the <code>.jar</code> file in the Java class path.</p>
<pre><code class="language-bash">cd ~/scala_lib
wget http://math.nist.gov/javanumerics/jama/Jama-1.0.3.jar
</code></pre>
<h2 id="usage-example">Usage Example</h2>
<p>In Scala console,</p>
<pre><code class="language-scala">// Import library
import Jama._
// Create the matrix
val A = new Matrix(3, 6)
// Fill the matrix with random numbers
val r = new scala.util.Random(0)
for(i <- 0 until A.getRowDimension())
for(j <- 0 until A.getColumnDimension())
A.set(i, j, r.nextDouble())
// JAMA does not provide methods to print the matrix,
// but we can view the data using the following trick
A.getArray().foreach(row => println(row.mkString("\t")))
// Set the first value to be the last value
A.set(0, 0, A.get(A.getRowDimension()-1, A.getColumnDimension()-1))
// Get a sub-matrix, 1st row to 3rd row, 1st column to 3rd column
val B = A.getMatrix(0, 2, 0, 2)
// Set a sub-matrix of A to B, 1st row to 3rd row,
// 2nd column to 4th column
A.setMatrix(0, 2, 1, 3, B)
// Matrix product C=A'B
val C = A.transpose().times(B)
// Solve linear equation
val a = A.getMatrix(0, 2, 0, 0)
val x = B.solve(a)
</code></pre>
<h2 id="documention">Documention</h2>
<p>The full documentation is at <a href="http://math.nist.gov/javanumerics/jama/doc/">http://math.nist.gov/javanumerics/jama/doc/</a>.</p>
<h1 id="4-apache-commons-math">4. Apache Commons Math</h1>
<h2 id="introduction-1">Introduction</h2>
<p><a href="http://commons.apache.org/proper/commons-math/index.html">Apache Commons Math</a>
is an Apache project aiming to address the most common mathematical and statistical
problems that are not available in the standard Java language. It supports both
dense and sparse matrix classes, equipped with basic operations as well as matrix
decomposition algorithms.</p>
<h2 id="installation-1">Installation</h2>
<p>Download a zip file and extract the <code>.jar</code> file into the Java class path.</p>
<pre><code class="language-bash">cd ~/scala_lib
wget http://supergsego.com/apache//commons/math/binaries/commons-math3-3.5-bin.tar.gz
tar xzf commons-math3-3.5-bin.tar.gz commons-math3-3.5/commons-math3-3.5.jar
mv commons-math3-3.5/commons-math3-3.5.jar .
rm commons-math3-3.5-bin.tar.gz
rm -r commons-math3-3.5
</code></pre>
<h2 id="usage-example-1">Usage Example</h2>
<p>In Scala console,</p>
<pre><code class="language-scala">// Import library
import org.apache.commons.math3.linear._
// Create the matrix
val A = new Array2DRowRealMatrix(3, 6)
// Fill the matrix with random numbers
val r = new scala.util.Random(0)
for(i <- 0 until A.getRowDimension())
for(j <- 0 until A.getColumnDimension())
A.setEntry(i, j, r.nextDouble())
// Set the first value to be the last value
A.setEntry(0, 0,
A.getEntry(A.getRowDimension() - 1, A.getColumnDimension() - 1))
// Get a sub-matrix, 1st row to 3rd row, 1st column to 3rd column
val B = A.getSubMatrix(0, 2, 0, 2)
// Set a sub-matrix of A to B, 1st row to 3rd row,
// 2nd column to 4th column
A.setSubMatrix(B.getData(), 0, 1)
// Matrix product C=A'B
val C = A.transpose().multiply(B)
// Solve linear equation
val solver = new LUDecomposition(B).getSolver()
val a = A.getColumnVector(0)
val x = solver.solve(a)
</code></pre>
<h2 id="documention-1">Documention</h2>
<p>The full documentation is at
<a href="http://commons.apache.org/proper/commons-math/userguide/linear.html">http://commons.apache.org/proper/commons-math/userguide/linear.html</a>.</p>
<h1 id="5-la4j">5. la4j</h1>
<h2 id="introduction-2">Introduction</h2>
<p><a href="http://la4j.org/">la4j</a> is a light weight linear algebra library for Java,
supporting both dense and sparse matrices, as well as matrix decomposition
algorithms.</p>
<h2 id="installation-2">Installation</h2>
<p>Download the <code>.jar</code> file into the Java class path.</p>
<pre><code class="language-bash">cd ~/scala_lib
wget http://central.maven.org/maven2/org/la4j/la4j/0.5.5/la4j-0.5.5.jar
</code></pre>
<h2 id="usage-example-2">Usage Example</h2>
<p>In Scala console,</p>
<pre><code class="language-scala">// Import library
import org.la4j.matrix._
import org.la4j.linear._
// Create the matrix
val A = DenseMatrix.zero(3, 6)
// Fill the matrix with random numbers
val r = new scala.util.Random(0)
for(i <- 0 until A.rows())
for(j <- 0 until A.columns())
A.set(i, j, r.nextDouble())
// Set the first value to be the last value
A.set(0, 0, A.get(A.rows() - 1, A.columns() - 1))
// Get a sub-matrix, 1st row to 3rd row, 1st column to 3rd column
val B = A.slice(0, 0, 3, 3)
// Set a sub-matrix of A to B, 1st row to 3rd row,
// 2nd column to 4th column.
// It seems that there is no direct sub-matrix setting function,
// so we set columns one by one
for(i <- 0 to 2)
A.setColumn(i + 1, B.getColumn(i))
// Matrix product C=A'B
val C = A.transpose().multiply(B)
// Solve linear equation
val solver = new GaussianSolver(B)
val a = A.getColumn(0)
val x = solver.solve(a)
</code></pre>
<h2 id="documention-2">Documention</h2>
<p>The full documentation is at <a href="http://la4j.org/apidocs/">http://la4j.org/apidocs/</a>.</p>
<h1 id="6-ejml">6. EJML</h1>
<h2 id="introduction-3">Introduction</h2>
<p>Efficient Java Matrix Library (<a href="http://ejml.org/wiki/index.php">EJML</a>)
is a linear algebra library for manipulating dense matrices.
It is designed to be computationally efficient.</p>
<h2 id="installation-3">Installation</h2>
<p>Download a zip file and extract the <code>.jar</code> file into the Java class path.</p>
<pre><code class="language-bash">cd ~/scala_lib
wget http://downloads.sourceforge.net/project/ejml/v0.28/ejml-v0.28-libs.zip
unzip ejml-v0.28-libs.zip
mv ejml-v0.28-libs/* .
rm ejml-v0.28-libs.zip
rm -r ejml-v0.28-libs
</code></pre>
<h2 id="usage-example-3">Usage Example</h2>
<p>In Scala console,</p>
<pre><code class="language-scala">// Import library
import org.ejml.simple._
// Create the matrix
val A = new SimpleMatrix(3, 6)
// Fill the matrix with random numbers
val r = new scala.util.Random(0)
for(i <- 0 until A.numRows())
for(j <- 0 until A.numCols())
A.set(i, j, r.nextDouble())
// Set the first value to be the last value
A.set(0, 0, A.get(A.numRows() - 1, A.numCols() - 1))
// Get a sub-matrix, 1st row to 3rd row, 1st column to 3rd column
val B = A.extractMatrix(0, 3, 0, 3)
// Set a sub-matrix of A to B, 1st row to 3rd row,
// 2nd column to 4th column
// It seems that there is no direct sub-matrix setting function,
// so we set element by element
for(i <- 0 to 2)
for(j <- 0 to 2)
A.set(i, j + 1, B.get(i, j))
// Matrix product C=A'B
val C = A.transpose().mult(B)
// Solve linear equation
val a = A.extractVector(false, 0)
val x = B.solve(a)
</code></pre>
<h2 id="documention-3">Documention</h2>
<p>The full documentation is at <a href="http://ejml.org/javadoc/">http://ejml.org/javadoc/</a>.</p>
<h1 id="7-breeze">7. Breeze</h1>
<h2 id="introduction-4">Introduction</h2>
<p><a href="https://github.com/scalanlp/breeze">Breeze</a> is a Scala libary for
machine learning and numerical computing. It contains matrix and vector classes
and many other components such as statistical distributions, optimization,
integration, etc. Since this library is written in Scala, the syntax is generally
more elegant and convenient than those implemented in pure Java.</p>
<h2 id="installation-4">Installation</h2>
<p>At the time of writing there is no officially provided <code>.jar</code> file on the web, so it is
suggested to build the library from source code, which may take some efforts.</p>
<p>The following commands automatically downloads the source code of Breeze and
a necessary building tool <a href="http://www.scala-sbt.org/">sbt</a>, and then builds the
<code>.jar</code> file from source.</p>
<pre><code class="language-bash">cd ~/scala_lib
wget https://github.com/scalanlp/breeze/archive/master.zip
unzip master.zip
cd breeze-master
wget https://dl.bintray.com/sbt/native-packages/sbt/0.13.9/sbt-0.13.9.tgz
tar xzf sbt-0.13.9.tgz
./sbt/bin/sbt assembly
mv target/scala-2.*/breeze-*.jar ..
cd ..
rm master.zip
rm -r breeze-master
</code></pre>
<h2 id="usage-example-4">Usage Example</h2>
<p>In Scala console,</p>
<pre><code class="language-scala">// Import library
import breeze.linalg._
import breeze.numerics._
// Create the matrix
val A = DenseMatrix.zeros[Double](3, 6)
// Fill the matrix with random numbers
val r = new scala.util.Random(0)
for(i <- 0 until A.rows)
for(j <- 0 until A.cols)
A(i, j) = r.nextDouble()
// Set the first value to be the last value
A(0, 0) = A(A.rows - 1, A.cols - 1)
// Get a sub-matrix, 1st row to 3rd row, 1st column to 3rd column
// We need to make a copy here since without it changing B will
// also change A
val B = A( :: , 0 to 2).copy
// Set a sub-matrix of A to B, 1st row to 3rd row,
// 2nd column to 4th column
A( :: , 1 to 3) := B
// Matrix product C=A'B
val C = A.t * B
// Solve linear equation
val a = A( :: , 0)
val x = B \ a
</code></pre>
<h2 id="documention-4">Documention</h2>
<ul>
<li><a href="https://github.com/scalanlp/breeze/wiki/Quickstart">Quick Start</a></li>
<li><a href="https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet">Linear Algebra Cheat Sheet</a></li>
</ul>
<h1 id="8-summary">8. Summary</h1>
<p>All the libraries written in Java have similar syntax and function names. There
is also <a href="http://lessthanoptimal.github.io/Java-Matrix-Benchmark/runtime/2013_10_Corei7v2600/">a benchmark</a>
of matrix operations on different libraries, including the ones mentioned here.</p>
<p>Breeze takes advantage of the Scala syntax, which makes the code more elegant
and easier to write. For example matrix elements can be accessed or set using
parentheses, and operator overloading allows users to write matrix operations
just like in mathematical formulas. Due to this reason Breeze seems to be a
good choice for matrix manipulation in Scala.</p>
Using showtext in knitr
https://statr.me/2014/07/showtext-with-knitr/
Mon, 21 Jul 2014 00:00:00 +0000https://statr.me/2014/07/showtext-with-knitr/<p>Thanks to the <a href="https://github.com/yihui/knitr/issues/799">issue report</a> by
<a href="https://github.com/yufree">yufree</a> and Yihui’s
<a href="https://github.com/yihui/knitr">kind work</a>,
from version 1.6.10 (development version), <strong>knitr</strong> starts to support using
<a href="https://github.com/yixuan/showtext"><strong>showtext</strong></a>
to change fonts in R plots. To demonstrate its usage, this document
itself serves as an example. (<a href="https://github.com/yixuan/en/blob/gh-pages/files/showtext-knitr.Rmd">Rmd source code</a>)</p>
<p>We first do some setup work, mainly about setting options that control
the appearance of the plots. Notice that if you create plots in PNG
format (the default format for HTML output), it is strongly recommended
to use the <code>CairoPNG</code> device rather than the default <code>png</code>, since
the latter one could produce quite ugly plots when using <strong>showtext</strong>.</p>
<pre><code>```{r setup}
knitr::opts_chunk$set(dev="CairoPNG", fig.width=7, fig.height=7, dpi = 72)
options(digits = 4)
```
</code></pre>
<p>Then we can load <strong>showtext</strong> package and add fonts to it. Details about
font loading are explained in the
<a href="https://github.com/yixuan/showtext/blob/master/README.md">introduction document</a>
of <strong>showtext</strong> and also in the help topic <code>?sysfonts::font.add</code>.
While searching and adding fonts may be a tedious work,
package <strong>sysfonts</strong> (which <strong>showtext</strong> depends on)
provides a convenient function <code>font.add.google()</code> to automatically download
and use fonts from the Google Fonts repository
(<a href="https://www.google.com/fonts">https://www.google.com/fonts</a>).
The first parameter is the font name in Google Fonts and the second one is
the family name that will be used in R plots.</p>
<pre><code>```{r fonts, message=FALSE}
library(showtext)
font.add.google("Lobster", "lobster")
```
</code></pre>
<p>After adding fonts, simply set the <code>fig.showtext</code> option in the code block
where you want to use <strong>showtext</strong>, and then specify the family name you
just added.</p>
<pre><code>```{r fig.showtext=TRUE, fig.align='center'}
plot(1, pch = 16, cex = 3)
text(1, 1.1, "A fancy dot", family = "lobster", col = "steelblue", cex = 3)
```
</code></pre>
<div align="center">
<img src="https://i.imgur.com/pO87LFy.png" />
</div>
Introduction to dynamic document and knitr
https://statr.me/2014/04/intro-to-knitr/
Thu, 03 Apr 2014 00:00:00 +0000https://statr.me/2014/04/intro-to-knitr/<p>Today I gave a presentation for GSO(Graduate Student Organization) of our department,
mainly about the idea of dynamic document and its implementation using knitr.</p>
<p><a href="http://archive.statr.me/files/GSO/GSO-knitr-new.html">Here</a> are the slides I showed in the talk,
written with Markdown and knitr.</p>
Using system fonts in R graphs
https://statr.me/2014/01/using-system-fonts-in-r-graphs/
Wed, 01 Jan 2014 00:00:00 +0000https://statr.me/2014/01/using-system-fonts-in-r-graphs/
<p>This is a pretty old topic in R graphics.
<a href="https://cran.r-project.org/doc/Rnews/Rnews_2006-2.pdf">A classical article in R NEWS</a>,
<em>Non-standard fonts in PostScript and PDF graphics</em>,
describes how to use and embed system fonts in the PDF/PostScript device.
More recently, <a href="https://github.com/wch">Winston Chang</a> developed
the <a href="https://github.com/wch/extrafont">extrafont</a> package, which
makes the procedure much easier. A useful introduction article can be found in the
<a href="https://github.com/wch/extrafont/blob/master/README.md">readme page</a> of <code>extrafont</code>,
and also from the <a href="http://blog.revolutionanalytics.com/2012/09/how-to-use-your-favorite-fonts-in-r-charts.html">Revolution blog</a>.</p>
<p>Now, we have another choice: the <code>showtext</code> package.</p>
<p><a href="https://github.com/yixuan/showtext/">showtext</a> 0.2 has just been
<a href="https://cran.r-project.org/web/packages/showtext/index.html">submitted to CRAN</a>.
Below is the introduction of this package excerpted from the
<a href="https://github.com/yixuan/showtext/blob/master/README.md">README.md</a> file. In short,</p>
<blockquote>
<p>We are now much freer to use system fonts in R to create figures.</p>
</blockquote>
<h2 id="what-s-this-package-all-about">What’s this package all about?</h2>
<p><code>showtext</code> is an R package to draw text in R graphs.</p>
<blockquote>
<p>Wait, R already has <code>text()</code> function to do that…</p>
</blockquote>
<p>Yes, but drawing text is a very complicated task, and it always depends on
the specific <strong>Graphics Device</strong>.
(Graphics device is the engine to create images.
For example, R provides PDF device, called by function <code>pdf()</code>,
to create graphs in PDF format)
Sometimes the graphics device doesn’t support text drawing nicely,
<strong>especially in using fonts</strong>.</p>
<p>From my own experience, I find it always troublesome to create PDF
graphs with Chinese characters. This is because most of the standard
fonts used by <code>pdf()</code> don’t contain Chinese character glyphs, and
even worse users could hardly use the fonts that are already installed
in their operating system. (It seems still possible, though)</p>
<p><code>showtext</code> tries to do the following two things:</p>
<ul>
<li>Let R know about these system fonts</li>
<li>Use these fonts to draw text</li>
</ul>
<h2 id="why-pdf-doesn-t-work-and-how-showtext-works">Why <code>pdf()</code> doesn’t work and how <code>showtext</code> works</h2>
<p>Let me explain a little bit about how <code>pdf()</code> works.</p>
<p>To my best knowledge (may be wrong, so please point it out if I make
mistakes), the default PDF device of R doesn’t “draw” the text,
but actually “describes” the text in the PDF file.
That is to say, instead of drawing lines and curves of the actual glyph,
it only embeds information about the text, for example what characters
it has, which font it uses, etc.</p>
<p>However, the text with declared font may be displayed differently in
different OS. The two images below are the screenshots of the same PDF
file created by R but viewed under Windows and Linux respectively.</p>
<div align="center">
<img src="https://i.imgur.com/x1zM34F.png" />
</div>
<p>This means that the appearance of graph created by <code>pdf()</code> is
system dependent. If you unfortunately don’t have the declared font
in your system, you may not be able to see the text correctly at all.</p>
<p>In comparison, <code>showtext</code> package tries to solve this problem by
converting text into lines and curves, thus having the same appearance
under all platforms. More importantly, <code>showtext</code> can use system font
files, so you can show your text in any font you want.
This solves the Chinese character problem I mentioned in the beginning
because I can load my favorite Chinese font to R and use that to draw
text. Also, people who view this graph don’t need to install the font
that creates the graph. It provides convenience to both graph makers
and graph viewers.</p>
<h2 id="the-usage">The Usage</h2>
<p>To create a graph using a specified font, you only need to do:</p>
<ul>
<li>(*) Load the font</li>
<li>Open the graphics device</li>
<li>(*) Claim that you want to use <code>showtext</code> to draw the text</li>
<li>Plot</li>
<li>Close the device</li>
</ul>
<p>Only the steps marked with (*) are newly added. Below is an example:</p>
<pre><code class="language-r">library(showtext)
font.add("fang", "simfang.ttf") ## add font
pdf("showtext-ex1.pdf")
plot(1, type = "n")
showtext.begin() ## turn on showtext
text(1, 1, intToUtf8(c(82, 35821, 35328)), cex = 10, family = "fang")
showtext.end() ## turn off showtext
dev.off()
</code></pre>
<div align="center">
<img src="https://i.imgur.com/u5uvjy5.png" />
</div>
<p>The use of <code>intToUtf8()</code> is for convenience if you can’t view or input
Chinese characters. You can instead use</p>
<pre><code class="language-r">text(1, 1, "R语言", cex = 10, family = "fang")
</code></pre>
<p>This example should work fine on Windows. For other OS, you may not have
the <code>simfang.ttf</code> font file, but there is no difficulty in using something
else. You can see the next section to learn details about how to load
a font with <code>showtext</code>.</p>
<h2 id="loading-font">Loading font</h2>
<p>Loading font is actually done by package <a href="https://github.com/yixuan/sysfonts/">sysfonts</a>,
which is depended on by <code>showtext</code>.</p>
<p>The easiest way to load font into R is by calling <code>font.add(family, regular, ...)</code>,
where <code>family</code> is the name that you give to that font (so that later you can
call <code>par(family = ...)</code> to use this font in plotting), and <code>regular</code> is the
path to the font file. Usually the font file will be located in some “standard”
directories in the system (for example on Windows it is typically C:/Windows/Fonts).
You can use <code>font.paths()</code> to check the current search path or add a new one,
and use <code>font.files()</code> to list available font files in the search path.</p>
<p>Usually there are many free fonts that can be downloaded from the web and then used by
<code>showtext</code>, as the following example shows:</p>
<pre><code class="language-r">library(showtext)
wd = setwd(tempdir())
download.file("http://fontpro.com/download-family.php?file=35701",
"merienda-r.ttf", mode="wb")
download.file("http://fontpro.com/download-family.php?file=35700",
"merienda-b.ttf", mode="wb")
font.add("merienda",
regular = "merienda-r.ttf",
bold = "merienda-b.ttf")
setwd(wd)
pdf("showtext-ex2.pdf", 7, 4)
plot(1, type = "n", xlab = "", ylab = "")
showtext.begin()
par(family = "merienda")
text(1, 1.2, "R can use this font!", cex = 2)
text(1, 0.8, "And in Bold font face!", font = 2, cex = 2)
showtext.end()
dev.off()
</code></pre>
<div align="center">
<img src="https://i.imgur.com/EUIGQ6L.png" />
</div>
<p>In this case we add two font faces(regular and bold) with the family name
“merienda”, and use <code>font = 2</code> to select the bold font face (<code>font = 1</code> is
selected by default, which is the regular font face).</p>
<p>At present <code>font.add()</code> supports TrueType fonts(*.ttf/*.ttc) and
OpenType fonts(*.otf), but adding new
font type is trivial as long as FreeType supports it.</p>
<p>Note that <code>showtext</code> includes an open source CJK font
<a href="http://wenq.org/wqy2/index.cgi?MicroHei%28en%29">WenQuanYi Micro Hei</a>.
If you just want to show CJK text in your graph, you don’t need to add any
extra font at all.</p>
<h2 id="known-issues">Known issues</h2>
<p>The image created by bitmap graphics devices (<code>png()</code>, <code>jpeg()</code>, …)
looks ugly because they don’t support anti-alias feature well. To produce
high-quality output, try to use the <code>CairoPNG()</code> and <code>CairoJPEG()</code> devices from the
<a href="https://cran.r-project.org/web/packages/Cairo/index.html">Cairo</a> package.</p>
<h2 id="the-internals-of-showtext">The internals of <code>showtext</code></h2>
<p>Every graphics device in R implements some functions to draw specific graphical
elements, e.g., <code>line()</code> to draw lines, <code>path()</code> and <code>polygon()</code> to draw polygons,
<code>text()</code> or <code>textUTF8()</code> to show text, etc. What <code>showtext</code> does is to override
their own text rendering functions and replace them by hooks provided in <code>showtext</code>
that will further call the device’s <code>path()</code>, <code>polygon()</code> or <code>line()</code> to draw the
character glyphs.</p>
<p>This action is done only when you call <code>showtext.begin()</code> and won’t modify the
graphics device if you call <code>showtext.end()</code> to restore the original device functions back.</p>
A conversation with Hadley Wickham
https://statr.me/2013/09/a-conversation-with-hadley-wickham/
Fri, 27 Sep 2013 00:00:00 +0000https://statr.me/2013/09/a-conversation-with-hadley-wickham/<p><img src="https://i.imgur.com/EPgIMLi.jpg" class="align-right"/></p>
<blockquote>
<p>Dr. Hadley Wickham is the Chief Scientist of RStudio and Assistant Professor
of Statistics at Rice University. He is the developer of the famous R package <code>ggplot2</code>
for data visualization and the author of many other widely used packages like <code>plyr</code>
and <code>reshape2</code>. On Sep 13, 2013 he gave a talk at Department of Statistics,
Purdue University, and later I (Yixuan) had a conversation with him (Hadley), talking
about his own experience and interest on data visualization, data tidying, R programming
and other related topics.</p>
<p>Below is the written record of our conversation, with a Chinese translation posted in
<a href="https://cosx.org/">Capital of Statistics</a>, the largest online community on Statistics in China.</p>
</blockquote>
<p><strong>Yixuan:</strong> Can you first tell us, how did you choose to enter the field of
Statistics and data science?</p>
<p><strong>Hadley:</strong> I got my first degree from a medical school, and actually I was in the
half way to be a doctor (laugh). But I realized I didn’t want to be a
doctor, so I went back to what I enjoyed in high school which was computer
science and statistics. I really liked programming and statistics,
and then did PhD in the United States, choosing data visualization
and multivariate data analysis as my thesis topic.</p>
<p><strong>Yixuan:</strong> How do you define data scientist compared to statistician?</p>
<p><strong>Hadley:</strong> I think they are basically the same. Data scientists may mind more
about databases, more about programming, but basically they both
tried to do the same thing.</p>
<p><strong>Yixuan:</strong> So it’s like that data scientists are more involved in data in practice?</p>
<p><strong>Hadley:</strong> Yeah. Traditionally statistics focuses much on the mathematical side.
A mathematical background is as important as a programming background.
To some extent it doesn’t matter if you know the right thing to do but
you can’t actually tell the computer how to do it. But equally
it doesn’t matter if you can tell a computer what to do but you don’t
know what you are doing. (laugh)</p>
<p><strong>Yixuan:</strong> Can you illustrate the most exciting thing and most challenging thing
in your work?</p>
<p><strong>Hadley:</strong> The thing I’m most excited about now is the next version of
<a href="http://ggplot2.org/"><code>ggplot2</code></a>,
which is called <a href="https://github.com/rstudio/ggvis"><code>ggvis</code></a>.
It goes around with graphics and interactivity.
I’m working on that with Winston Chang, who is also in RStudio.
Hopefully we can have something to show by the end of the year.</p>
<p><strong>Yixuan:</strong> So that’s a long term plan for the <code>ggplot2</code>?</p>
<p><strong>Hadley:</strong> Yeah. Basically it is pretty obvious that for data visualization now
you want to be doing them on the web, because everyone has a web
browser. And another thing is that the people who spend the most time
making graphics in general very fast across every platform are the
browser makers. There is a lot of competition between Chrome
and Firefox, about who can be the faster. And a lot of that now is
making it possible to do interactive graphics and statistical graphics
really quickly, much more than you can do in the past. That is a really
important principle to make graphics and also to make it easy to add
interactivity to graphics. For example, you can add a slider that
automatically changes the bin of a histogram, or the span of a loess
smoother. So it’s pretty fun working on that.</p>
<p><strong>Yixuan:</strong> What about the most challenging thing?</p>
<p><strong>Hadley:</strong> Another thing I’m working on the moment is
<a href="https://github.com/hadley/dplyr"><code>dplyr</code></a>, which is the next
iteration of <a href="http://plyr.had.co.nz/"><code>plyr</code></a>.
I have to learn a lot about how to write efficient
SQL. If you asked me two weeks ago about how much SQL I knew, maybe I
would say 75% of it. But now, after I’ve used it, I realize that I only
understand about 25% of it. It is much much richer and more complicated
than I realized. That’s both challenging and fun to learn it.</p>
<p><strong>Yixuan:</strong> OK. So from my own point of view, previously you are most famous for
the <code>ggplot2</code> package. And now we see you are paying more effort on some
data tidying tools like <code>plyr</code> and <code>reshape2</code>, and you also wrote some
tutorials about high performance computing using <code>Rcpp</code>. So how are these
techniques related to each other? The data visualization, the computing,
and data tidying?</p>
<p><strong>Hadley:</strong> What I’m interested in is how to make data analysis easy and fast. So
just look at how much time you spend doing each part of data analysis.
If you spend 8 hours doing data cleaning and data tidying, but 2 hours
doing modeling, then you want to make the process faster. Obviously
you try to figure out how to make data tidying and data cleaning
faster at this time. Just like my talk today, we may find that the two
bottlenecks are what you want to do, and how you tell the computer to
do that. A lot of my existing work, like <code>ggplot2</code>, <code>plyr</code> and <code>reshape2</code> have
been more about how you make it easier to express what you want, not
how you make the computer fast.</p>
<p>Now it’s easier to do all these sort of things, and the bottleneck is
actually doing the computation. Now I’m trying to learn how to write
fast code, how to write efficient R code, and how to connect to C++ to
achieve more speed. It’s a kind of process to keep going around. If
the bottleneck is here, then I go to fix this one. Now it takes less
time, and the bottleneck shifts over there, then I work on that
problem.</p>
<p><strong>Yixuan:</strong> So you are trying to reduce both the time in describing data, and also
the time in computing part.</p>
<p><strong>Hadley:</strong> Right. And another thing I’m interested in is generally …
I know I cannot write every R package that people need, so how can I
make it easier for other people to write good R code, and to write R
packages?</p>
<p><strong>Yixuan:</strong> That’s the <a href="https://github.com/hadley/devtools"><code>devtools</code></a>?</p>
<p><strong>Hadley:</strong> Yeah, just make it easier to make other people to use R and contribute.</p>
<p><strong>Yixuan:</strong> Can you introduce your toolbox in data analysis, about the softwares
and languages you use?</p>
<p><strong>Hadley:</strong> I’m now pretty much an RStudio user. I used to use Sublime Text in the
Mac, but I’ve anyway shifted in the last couple of months. Fow now
RStudio is just easier to get around functions. I’ve spent 90%
of my time inside R. My job is analyzing data, and I’m also trying to
figure out like “what do I think people are trying to do”, “what are
people struggling with”, “how can I make it easier to express in R”,
and “how can I make the code more efficient”. So I still write mostly
in R, but if I discover bottleneck, then I may write C++ to make it
faster. The challenge is that you can write the code much much faster,
but it takes much much longer time to write. If I make a mistake, it’s
likely to crash R, and you need to get started from the scratch,
a little bit annoying. But at least now, if R crashes, RStudio will
just restart R and keep going.</p>
<p><strong>Yixuan:</strong> Many of the visitors of <a href="https://cosx.org/">our website</a>
are curious about the dynamic
graphics. They want to know whether you have any plan to integrate
for example R and the D3 library in the next generation of <code>ggplot2</code>.
Any plan or progress about that?</p>
<p><strong>Hadley:</strong> <code>ggvis</code> works by generating <a href="http://trifacta.github.io/vega/">Vega</a> code,
and Vega is a library built on
top of D3. So <code>ggvis</code> is very much like built on top of that, and also
supports dynamic and interactive graphics. I may show you a demo of
ggvis.</p>
<p>(Showing the demo)</p>
<p><strong>Yixuan:</strong> Another major change of software development we can see these years
is that social coding becomes mroe popular. Many developers have a
<a href="https://github.com/">Github</a> account, for example. Do you think that will change the way we
develop R and related packages?</p>
<p><strong>Hadley:</strong> I think so. Certainly I find that the time between creating a Github
repository and my first pull request is getting smaller and smaller.
Recently I created a new repository, and I didn’t tell anyone about
that. After four hours there was a pull request. I think one of the
really nice thing about social coding is that authors get motivated
because you can see other people not only using it but also caring
about it, which is really really cool. We’ve talked a little bit
about how we can make use of <a href="https://gist.github.com/">Gists</a>.
One example is <a href="https://rpubs.com/">RPubs</a>. That should be based
on Gist, so you can fork someone else’s work and add some
modification, and if they want they can pull the changes back – we have a lot
of ideas around that.</p>
<p>Another thing I was doing lately was trying to figure out the best way
of reading a file, and R just gave one answer. I wrote an R Markdown
document providing three methods, and then I did a little bit of
benchmarking to see which one is faster. I tweeted about it, and people
forked and suggested other ways which were even faster. That’s
a really good way to learn, like “here is my best effort, can you do
any better?” Any time when there are two people collaborating to write
a piece of code, it is almost always better than just one person.</p>
<p><strong>Yixuan:</strong> And as the chief scientist in <a href="https://www.rstudio.com/">RStudio</a>,
do you have any future plan for RStudio?</p>
<p><strong>Hadley:</strong> I’m looking forward to the day when there are more scientists (laugh).
I think when we start making money, we will start investing on the top
of the R community. One thing that we would like to do is how to make
R as a language faster and more efficient.</p>
<p>I’m also really interested in statistical learning as a family of
modeling techniques, a kind of fitting them together very well and
forming a grammer of models. You can make a new model by joing things
together, just like the grammer of graphics by which you can come up
with a new graphic that is just a new arrangment of existing components.
I think that’s something that makes it easier to learn modeling.
For example, you can learn a linear model in this way, a random forest
in that way, and you can learn them in a unified framework.</p>
<p>Another thing I’d like to think is the Lasso-type method. In one of my
classes, I want to show that now you should always try stablized
regression, you should always try to do Lasso and the similar.
I think there are 13 packages that would do Lasso, and I tried them
all. But every single one of them broke for a different reason. For
example, it didn’t support missing values, it didn’t support
categorical variables, it didn’t do predictions and standard errors,
or it didn’t automatically find the lambda parameter. Maybe that’s
because the authors are more interested in the theoretical papers,
not in providing a tool that you can use in data analysis. So I want
to integrate them together to form a tool that is fast and works well.</p>
<p><strong>Yixuan:</strong> For our team (<a href="(https://cosx.org/)">Capital of Statistics</a>),
we have translated your ggplot2 book,
and Winston’s R Graphics Cookbook is also ongoing. Can you introduce your next book,
if I’m correct, the <a href="http://adv-r.had.co.nz/">Advanced R Programming</a>?</p>
<p><strong>Hadley:</strong> The goal of the Advanced R Programming is basically to help people
become better R programmers. Lots and lots of books are about how to
do statistics with R, but not many about programming in R. Matloff’s
<a href="http://nostarch.com/artofr.htm">Art of R Programming</a> is a kind of
good basic and intermediate book,
and what I want to introduce are some of the features of the R
language that I think are really cool and powerful. To learn how to
use it you need to read a lot of documentation and I do a lot of
experiments to tell how things work. So I’m really interested in
helping people understand and write more efficient and also more
expressive code.</p>
<p>I think R has a reputation for being a horrible programming language,
but that’s not really true. I think the heart of R is really a
beautiful and elegant language. The majority of people using R are not
programmers, so there is really elegant core as well as very tedious
R code. I think R is like javascript. There is a book called
<a href="http://shop.oreilly.com/product/9780596517748.do">JavaScript: The Good Parts</a>,
that tries to pull out that part.
The goal of my book is similar, not just telling people how to write
R in the elegant way, but also to make it easier for them to solve
problems, by introducing a little bit more the theory that underlies R.</p>
<p><strong>Yixuan:</strong> The last question: what are your hobbies when you are not working?</p>
<p><strong>Hadley:</strong> I like to cook. Recently I’ve been doing grilling, learning American
barbecue food. I also like to make cocktails.</p>
<p><strong>Yixuan:</strong> OK. Thank you for the conversation!</p>
<p><img src="https://i.imgur.com/ICvLmEQ.jpg" class="aligncenter"/></p>
<p>(Hadley Wickham with his <em>ggplot2</em> book in Chinese translation)</p>
Is Normal normal?
https://statr.me/2012/09/is-normal-normal/
Tue, 18 Sep 2012 00:00:00 +0000https://statr.me/2012/09/is-normal-normal/<p>The rumor says that Normal distribution is everything.</p>
<p>It will take a long long time to talk about the Normal distribution thoroughly.
However, today I will focus on a (seemingly) simple question, as is stated below:</p>
<blockquote>
<p>If $X$ and $Y$ are univariate Normal random variables, will $X+Y$ also be Normal?</p>
</blockquote>
<p>What’s your reaction towards this question? Well, at least for me, when I saw it I said
“Oh, it’s stupid. Absolutely it is Normal. And what’s more, any linear combination of
Normal random variables should be Normal.”</p>
<p>Then I’m wrong, and that’s why I want to write this blog.</p>
<p>A counter-example is given by the book <em>Statistical Inference</em>
(George Casella and Roger L. Berger, 2nd Edition), in Excercise 4.47:</p>
<p>Let $X$ and $Y$ be independent $N(0,1)$ random variables, and define a new random variable $Z$ by</p>
<p>$$Z = \begin{cases} X &\text{if } XY > 0 \\ -X & \text{otherwise} \end{cases}$$</p>
<p>Then it can be shown that $Z$ has a normal distribution, while $Y+Z$ is not.</p>
<p>Here I will not put any analytical proof, but use some descriptive graphs to show this. Below is the
R code to do the simulation.</p>
<pre><code class="language-r">set.seed(123);
x = rnorm(2000);
y = rnorm(2000);
z = ifelse(x * y > 0, x, -x);
par(mfrow = c(2, 1));
hist(y);
hist(z);
x11();
hist(y + z);
</code></pre>
<p>We obtain the random numbers of $X,Y$ and $Z$, and then use histograms to show their distributions.</p>
<p><img src="https://i.imgur.com/1zx8P.png" alt="Histograms of Y and Z" class="aligncenter"/></p>
<p><img src="https://i.imgur.com/6allQ.png" alt="Histograms of Y+Z" class="aligncenter"/></p>
<p>The result is clear: Both $Y$ and $Z$ should be Normal, but $Y+Z$ has a two-mode distribution, which
is obviously non-Normal.</p>
<p>So what’s wrong? It is not uncommon that we hear from somewhere, that linear combinations of
Normal r.v.’s are also Normal, but we often omit an important condition:
<strong>their joint distribution must be multivariate Normal</strong>. The formal proposition is stated below:</p>
<blockquote>
<p>If $X$ follows a multivariate Normal distribution, then any linear combination of the elements
of $X$ also follows a Normal distribution.</p>
</blockquote>
<p>In our example, we can prove that the joint distribution of $(Y,Z)$ is not bivariate Normal,
although the marginal distributions are Normal indeed.</p>
<p>Then you may wonder how to construct more examples like this, that is, $Y,Z$ are both $N(0,1)$
random variabels, but $(Y,Z)$ is not a bivariate Normal. This is an interesting question,
and in fact, it’s much related to the <strong>Copula</strong> model. Here I only give some specific examples,
while the details about Copula model may be provided in future posts.</p>
<p>Consider functions</p>
<p>$$C_1(u,v)=[\max(u^{-2}+v^{-2}-1,0)]^{-1 / 2}$$</p>
<p>$$C_2(u,v)=\exp(-[(\ln u)^2+(\ln v)^2]^{1 / 2})$$</p>
<p>$$C_3(u,v)=-\ln\left(1+\frac{(e^{-u}-1)(e^{-v}-1)}{e^{-1}-1}\right)$$</p>
<p>and use $\Phi(y)$ to denote the c.d.f. of $N(0,1)$ distribution,
then $C_1(\Phi(y),\Phi(z))$, $C_2(\Phi(y),\Phi(z))$ and $C_3(\Phi(y),\Phi(z))$ are all
joint distribution functions that satisfy 1) not bivariate Normal and 2) marginal distributions
are $N(0,1)$.</p>
<p>Seems good, right?</p>
Handwriting recognition using R
https://statr.me/2011/12/handwriting-recognition-using-r/
Sun, 18 Dec 2011 00:00:00 +0000https://statr.me/2011/12/handwriting-recognition-using-r/<p>This title is a bit exaggerating since handwriting recognition is an advanced topic
in machine learning involving complex techniques and algorithms. In this blog I’ll
show you a simple demo illustrating how to recognize a single number (0 ~ 9) using R.
The overall process is that, you draw a number in a graphics device in R using your mouse,
and then the program will “guess” what you have input. It is just for <strong>FUN</strong>.</p>
<p>There are two major problems in this number recognition problem, that
is, how to describe the trace of your handwriting, and how to classify
this trace to the give classes (0 ~ 9).</p>
<p>For the first question, we could first detect the motion of your mouse
in the graphics device, and then record the coordinates of you mouse
cursor at a sequence of time points. This could be done via the
<code>getGraphicsEvent()</code> function in <strong>grDevices</strong> package. For example, after I
drew a number 2 in the graphics window like below, the coordinates of
each point in the trace were assigned to a pair of variables <code>px</code> and <code>py</code>.</p>
<p><img src="https://i.imgur.com/257Ng.png" alt="Record trace" class="aligncenter"/></p>
<p>The scatterplot of <code>px</code> and <code>py</code> versus their orders in the trace is
shown below.</p>
<p><img src="https://i.imgur.com/4gsCV.png" alt="Record points" class="aligncenter"/></p>
<p>To be comparable among different traces, we normalize the Order to be
within (0, 1] (that is, transform 1, 2, …, n to 1/n, 2/n, …, 1).
Also, since this recording is discrete but the real trace should be
continuous, we use the <code>spline()</code> function to interpolate at unknown
points, resulting in the following figure.</p>
<p><img src="https://i.imgur.com/M0Wos.png" alt="Record splines" class="aligncenter"/></p>
<p>The dots in the figure have normalized orders of 0.02, 0.04,
0.06, …, 1, at which the x and y coordinates are obtained by
interpolation. Therefore, we could use $r = (x, y)$ where
$x = (x_1, x_2, \ldots, x_{50})^\prime$ and $y = (y_1, y_2, \ldots, y_{50})^\prime$ to
represent the information of the number 2 I have drawn. Somewhat
confused by the operations above? Well, the idea behind this
normalization and interpolation is simple: use 50 “uniformly
ordered” points (I call them “recording points”) to represent the trace.</p>
<p>So it comes to the second question – given a trace, how to classify
it? Obviously we first need a training set, the recording points of
number 0 to number 9 generated as above. Then we’ll compare the
given trace with each one in the training set and find out which
number resembles it most.</p>
<p>Several criteria could be used to measure the similarity, but some
important rules should be considered. We still use $r = (x, y)$ to
represent the recording points of a trace, and use $Sim(r_1, r_2)$ to
stand for the similarity between two traces. Notice that this
similarity should not be sensitive to the scale and location of
traces. That is, if I draw a number in another location in the
window, or in a larger or smaller size, the recognition should not be
influenced. In mathematics, this could be expressed by</p>
<p>$$Sim(r_1, r_2) = Sim(k_1 r_1 + b_1, k_2 r_2 + b_2)$$</p>
<p>where $k_1 > 0$, $k_2 > 0$, $b_1$, $b_2$ are real numbers.</p>
<p>In my code, I simply define the similarity as the sum of Pearson
correlation coefficients of x and y, that is,</p>
<p>$$Sim(r_1, r_2) = Corr(r_1.x, r_2.x) + Corr(r_1.y, r_2.y)$$</p>
<p>The whole source code is (note that I use 500 recording points
instead of 50):</p>
<pre><code class="language-r">library(grid);
getData = function()
{
if(.Platform$OS.type == 'windows') x11() else x11(type = 'Xlib');
pushViewport(viewport());
grid.rect();
px = NULL;
py = NULL;
mousedown = function(buttons, x, y)
{
if(length(buttons) > 1 || identical(buttons, 2L))
return(invisible(1));
eventEnv$onMouseMove = mousemove;
NULL
}
mousemove = function(buttons, x, y)
{
px <<- c(px, x);
py <<- c(py, y);
grid.points(x, y);
NULL
}
mouseup = function(buttons, x, y) {
eventEnv$onMouseMove = NULL;
NULL
}
setGraphicsEventHandlers(onMouseDown = mousedown,
onMouseUp = mouseup);
eventEnv = getGraphicsEventEnv();
cat("Click down left mouse button and drag to draw the number,
right click to finish.\n");
getGraphicsEvent();
dev.off();
s = seq(0, 1, length.out = length(px));
spx = spline(s, px, n = 500)$y;
spy = spline(s, py, n = 500)$y;
return(cbind(spx, spy));
}
traceCorr = function(dat1, dat2)
{
cor(dat1[, 1], dat2[, 1]) + cor(dat1[, 2], dat2[, 2]);
}
# Please set the proper path of this file.
load("train.RData");
guess = function(verbose = FALSE)
{
test = getData();
coefs = sapply(recogTrain, traceCorr, dat2 = test);
num = which.max(coefs);
if(num == 10) num = 0;
if(verbose) print(coefs);
cat("I guess what you have input is ", num, ".\n", sep = "");
}
guess();
</code></pre>
<p>To run the code, you must load the “training set”, the file
<code>train.RData</code>, into R using the <code>load()</code> function, and then call
<code>guess()</code> to play with it.</p>
<p>Have fun!</p>
<p><a href="https://github.com/downloads/yixuan/en/Handwriting_recognition.zip">Download: Source code and training dataset</a></p>
Windows binary of RMySQL
https://statr.me/2011/10/windows-binary-of-rmysql/
Sat, 22 Oct 2011 00:00:00 +0000https://statr.me/2011/10/windows-binary-of-rmysql/<p>This binary package supports R 2.13.x (32-bit/64-bit) and MySQL 5.5.16 (32-bit/64-bit).</p>
<p><a href="https://github.com/downloads/yixuan/en/RMySQL_0.8-0.zip">RMySQL 0.8-0 for MySQL 5.5.16</a></p>
How to run regression on large datasets in R
https://statr.me/2011/10/large-regression/
Sun, 02 Oct 2011 00:00:00 +0000https://statr.me/2011/10/large-regression/<p>It’s well known that R is a memory based software, meaning that datasets must be copied into memory before
being manipulated. For small or medium scale datasets, this doesn’t cause any troubles.
However, when you need to deal with larger ones, for instance, financial time series or log data
from the Internet, the consumption of memory is always a nuisance.</p>
<p>Just to give a simple illustration, you can put in the following code into R to allocate a matrix
named x and a vector named y.</p>
<pre><code class="language-r">set.seed(123);
n = 5000000;
p = 5;
x = matrix(rnorm(n * p), n, p);
x = cbind(1, x);
bet = c(2, rep(1, p));
y = c(x %*% bet) + rnorm(n);
</code></pre>
<p>If I try to run a regression on x and y with the built-in function <code>lm()</code>, I get the error.</p>
<pre><code class="language-r">> lm(y ~ 0 + x);
Error: cannot allocate vector of size 19.1 Mb
In addition: Warning messages:
1: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
Reached total allocation of 1956Mb: see help(memory.size)
2: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
Reached total allocation of 1956Mb: see help(memory.size)
3: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
Reached total allocation of 1956Mb: see help(memory.size)
4: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
Reached total allocation of 1956Mb: see help(memory.size)
</code></pre>
<p>The parameters of my machine are:</p>
<ul>
<li>CPU: Intel Core i5-2410M @ 2.30 GHz</li>
<li>Memory: 2GB</li>
<li>OS: Windows 7 64-bit</li>
<li>R: 2.13.1 32-bit</li>
</ul>
<p>In R, each numeric number occupies <code>8 Bytes</code>, so we can estimate that x and y will only occupy
<code>5000000 * 7 * 8 / 1024 ^ 2 Bytes = 267 MB</code>, far less than the total memory size of 2GB.
However, the memory is still used up since <code>lm()</code> will compute many variables apart from
x and y, for example, the fitted values and residuals.</p>
<p>If we are only interested in the coefficient estimation, we can directly use matrix operations
to compute beta hat:</p>
<pre><code class="language-r">beta.hat = solve(t(x) %*% x, t(x) %*% y);
</code></pre>
<p>This runs successfully on my machine and the process is very fast of only about 0.6 seconds
(I use an optimized Rblas.dll, <a href="http://yixuan.cos.name/en/wp-content/uploads/2011/10/Rblas_gotoblas.tar.gz">download here</a>). Nevertheless, if the sample size is larger,
the matrix operation may also be unavailable. To provide an estimation,
when the sample size is as large as <code>2GB / 7 / 8 Bytes = 38347922</code>,
x and y themselves will swallow all the memory, let alone other temporary variables created
in the computation.</p>
<p>So how can we cope with this problem?</p>
<p>One approach to avoid too much consumption of memory is to use a database system and excute SQL
statements on it. Database restores data on the hard disk and uses a small buffer to run SQL,
so you don’t need to worry about the memory;
it’s just a matter of how long it takes to accomplish the computation.</p>
<p>R supports many database systems among which <a href="https://www.sqlite.org/">SQLite</a> is the lightest and the most convenient.
There is an <strong>RSQLite</strong> package in R that allows you to read/write data from/to an <strong>SQLite</strong> database,
as well as executing SQL statements on it and fetching results back to R. Therefore, if we can
“translate” our algorithm into SQL statements, then the size of data we can deal with will only depend on
the hard disk size and the execution time we can tolerate.</p>
<p>To continue with the example above, I’ll illustrate how to complete the regression using database and SQL.
First, we shall write the data into a database file on the hard disk. The code is</p>
<pre><code class="language-r">gc();
dat = as.data.frame(x);
rm(x);
gc();
dat$y = y;
rm(y);
gc();
colnames(dat) = c(paste("x", 0:p, sep = ""), "y");
gc();
# Will also load the DBI package
library(RSQLite);
# Using the SQLite database driver
m = dbDriver("SQLite");
# The name of the database file
dbfile = "regression.db";
# Create a connection to the database
con = dbConnect(m, dbname = dbfile);
# Write the data in R into database
if(dbExistsTable(con, "regdata")) dbRemoveTable(con, "regdata");
dbWriteTable(con, "regdata", dat, row.names = FALSE);
# Close the connection
dbDisconnect(con);
# Garbage collection
rm(dat);
gc();
</code></pre>
<p>I use a lot of <code>rm()</code> and <code>gc()</code> functions to remove unused temporary variables and cleanse the memory.
When all is done, you’ll find a regression.db file in your working directory whose size is about 320M.
Then it comes the most important step – translate the regression algorithm into SQL.</p>
<p>Recall that the expression of beta hat can be written in a quite simple form:</p>
<p>$$\hat{\beta}=(X^\prime X)^{-1}X^\prime y$$</p>
<p>Also note that however large the sample size $n$ is, $X^\prime X$ and $X^\prime y$ are always of the size
$(p+1) \times (p+1)$. If the number of variables is not very large, the inverse and multiplication
of matrices of that size could be easily handled by R, so our main target is to compute
$X^\prime X$ and $X^\prime y$ in SQL.</p>
<p>Rewrite $X$ as $X=(\mathbf{x_0,x_1,\ldots,x_p})$, then $X^\prime X$ could be expressed as</p>
<p>$$\left(\begin{array}{cccc} \mathbf{x_0^\prime x_0} & \mathbf{x_0^\prime x_1} & \ldots & \mathbf{x_0^\prime x_p} \\ \mathbf{x_1^\prime x_0} & \mathbf{x_1^\prime x_1} & \ldots & \mathbf{x_1^\prime x_p} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{x_p^\prime x_0} & \mathbf{x_p^\prime x_1} & \ldots & \mathbf{x_p^\prime x_p} \end{array}\right)$$</p>
<p>And each element in the matrix can be calculated using SQL, for example,</p>
<pre><code class="language-sql">select sum(x0 * x0), sum(x0 * x1) from regdata;
</code></pre>
<p>We can then use R to generate the character strings of SQL statement and send it to SQLite.
The code is as follows:</p>
<pre><code class="language-r">m = dbDriver("SQLite");
dbfile = "regression.db";
con = dbConnect(m, dbname = dbfile);
# Get variable names
vars = dbListFields(con, "regdata");
xnames = vars[-length(vars)];
yname = vars[length(vars)];
# Generate SQL statements to compute X'X
mult = outer(xnames, xnames, paste, sep = "*");
lower.index = lower.tri(mult, TRUE);
mult.lower = mult[lower.index];
sql = paste("sum(", mult.lower, ")", sep = "", collapse = ",");
sql = sprintf("select %s from regdata", sql);
txx.lower = unlist(dbGetQuery(con, sql), use.names = FALSE);
txx = matrix(0, p + 1, p + 1);
txx[lower.index] = txx.lower;
txx = t(txx);
txx[lower.index] = txx.lower;
# Generate SQL statements to compute X'Y
sql = paste(xnames, yname, sep = "*");
sql = paste("sum(", sql, ")", sep = "", collapse = ",");
sql = sprintf("select %s from regdata", sql);
txy = unlist(dbGetQuery(con, sql), use.names = FALSE);
txy = matrix(txy, p + 1);
# Compute beta hat in R
beta.hat.DB = solve(txx, txy);
t6 = Sys.time();
</code></pre>
<p>We can check whether the results are the same by calculating the maximum absolute difference:</p>
<pre><code class="language-r">> max(abs(beta.hat - beta.hat.DB));
[1] 3.028688e-13
</code></pre>
<p>A difference of rounding errors.</p>
<p>The computation takes about 17 seconds, far more than the matrix operation, but it consumes nearly
no extra memory to accomplish the computation, which is a typical example of “trade time for space”.
Furthermore, you may have noticed that the computation of
<code>sum(x0*x0), sum(x0*x1), ..., sum(x5*x5)</code> can be parallelized by opening several connections to the database
simultaneously, so if you have a multi-processor server, you may drastically reduce the time
after some arrangement.</p>
<p>The whole source code could be <a href="https://github.com/downloads/yixuan/en/DB_regression.tar.gz">downloaded here</a>.</p>