recosystem: recommender system using parallel matrix factorization
A Quick View of Recommender System
The main task of recommender system is to predict unknown entries in the rating matrix based on observed values, as is shown in the table below:
Each cell with number in it is the rating given by some user on a specific item, while those marked with question marks are unknown ratings that need to be predicted. In some other literatures, this problem may be named collaborative filtering, matrix completion, matrix recovery, etc.
A popular technique to solve the recommender system problem is the matrix factorization method. The idea is to approximate the whole rating matrix $R_{m\times n}$ by the product of two matrices of lower dimensions, $P_{k\times m}$ and $Q_{k\times n}$, such that
$$R\approx P^\prime Q$$
Let $p_u$ be the $u$-th column of $P$, and $q_v$ be the $v$-th column of $Q$, then the rating given by user $u$ on item $v$ would be predicted as $p^\prime_u q_v$.
A typical solution for $P$ and $Q$ is given by the following optimization problem [1; 2]:
$$\min_{P,Q} \sum_{(u,v)\in R} \left[f(p_u,q_v;r_{u,v})+\mu_P||p_u||_1+\mu_Q||q_v||_1+\frac{\lambda_P}{2} ||p_u||_2^2+\frac{\lambda_Q}{2} ||q_v||_2^2\right]$$
where $(u,v)$ are locations of observed entries in $R$, $r_{u,v}$ is the observed rating, $f$ is the loss function, and $\mu_P,\mu_Q,\lambda_P,\lambda_Q$ are penalty parameters to avoid overfitting.
The process of solving the matrices $P$ and $Q$ is called model training, and the selection of penalty parameters is parameter tuning. After obtaining $P$ and $Q$, we can then do the prediction of $\hat{R}_{u,v}=p^\prime_u q_v$.
LIBMF and recosystem
LIBMF is an open source C++ library for recommender system using parallel matrix factorization, developed by Dr. Chih-Jen Lin and his research group. [3]
LIBMF is a parallelized library, meaning that users can take advantage of multi-core CPUs to speed up the computation. It also utilizes some advanced CPU features to further improve the performance.
recosystem
(Github) is an R wrapper of
the LIBMF library that inherits most of its features. Additionally, this
package provides a number of user-friendly R functions to
simplify data processing and model building. Also, unlike most other R packages
for statistical modeling that store the whole dataset and model object in
memory, LIBMF (and hence recosystem
) can significantly reduce memory use,
for instance the constructed model that contains information for prediction
can be stored in the hard disk, and output result can also be directly
written into a file rather than be kept in memory.
Overview of recosystem
The usage of recosystem
is quite simple, mainly consisting of the following steps:
- Create a model object (a Reference Class object in R) by calling
Reco()
. - Specify the data source, either from a data file or from R objects in memory.
- Train the model by calling the
$train()
method. A number of parameters can be set inside the function. - (Optionally) Call the
$tune()
method to select best tuning parameters along a set of candidate values, in order to achieve better model performance. - (Optionally) Export the model via
$output()
, i.e. write the factorization matrices $P$ and $Q$ into files or return them as R objects. - Use the
$predict()
method to compute predicted values.
More details are covered in the package
vignette
and the help pages ?recosystem::Reco
, ?recosystem::data_source
, ?recosystem::train
,
?recosystem::tune
, ?recosystem::output
, and ?recosystem::predict
.
In the next section we will demonstrate how to use recosystem
to analyze a
real movie recommendation data set.
MovieLens Data
The MovieLens website collected many movie rating data for research use. [4] In this article we download the MovieLens 1M Dataset from grouplens, which contains 1 million ratings from 6000 users and 4000 movies.
The rating data file, ratings.dat
, looks like below:
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
...
Each line has the format UserID::MovieID::Rating::Timestamp
,
for example the first line says that User #1 gave Movie #1193 a rating of
5 at certain time point.
In recosystem
, we will not use the time information, and the required data
format is UserID MovieID Rating
, i.e., the columns are space-separated, and
columns after Rating
will be ignored.
Therefore, we first transform the data file into a format that is supported by
recosystem
. On Unix-like OS’s, we can use the sed
command to replace ::
by a space:
sed -e 's/::/ /g' ratings.dat > ratings2.dat
Then we can start to train a recommender, as the following code shows:
library(recosystem) # 1
r = Reco() # 2
train_set = data_file("ratings2.dat", index1 = TRUE) # 3
r$train(train_set, opts = list(dim = 20, # 4
costp_l1 = 0, costp_l2 = 0.01, # 5
costq_l1 = 0, costq_l2 = 0.01, # 6
niter = 10, # 7
nthread = 4)) # 8
In the code above, line 2 creates a model object such that the training
function $train()
can be called from it. Line 3 specifies the data
source – a data file on hard disk. Since in our data user ID and movie ID
start from 1 rather than 0, we use the index1 = TRUE
options in the function.
The data can also be read from memory, if the UserID, MovieID and Rating columns are stored as R vectors. Below shows an alternative way to provide the training set.
dat = read.table("ratings2.dat", sep = " ", header = FALSE,
colClasses = c(rep("integer", 3), "NULL"))
train_set = data_memory(user_index = dat[, 1],
item_index = dat[, 2],
rating = dat[, 3], index1 = TRUE)
Line 4 to line 6 set the relevant model parameters: $k, \mu_P,\mu_Q,\lambda_P$,
and $\lambda_Q$, and Line 7 gives the number of iterations. Finally as I have
mentioned previously, LIBMF is a parallelized library, so
users can specify the number of threads that will be working
simultaneously via the nthread
parameter. However, when nthread > 1
,
the training result is NOT guaranteed to be reproducible, even if
a random seed is set.
Now everything looks good, except one inadequacy: the setting of tuning
parameters is ad-hoc, which may make the model sub-optimal.
To tune these parameters, we can call the $tune()
function to test
a set of candidate values and use cross validation
to evaluate their performance. Below shows this process:
opts_tune = r$tune(train_set, # 9
opts = list(dim = c(10, 20, 30), # 10
costp_l2 = c(0.01, 0.1), # 11
costq_l2 = c(0.01, 0.1), # 12
costp_l1 = 0, # 13
costq_l1 = 0, # 14
lrate = c(0.01, 0.1), # 15
nthread = 4, # 16
niter = 10, # 17
verbose = TRUE)) # 18
r$train(train_set, opts = c(opts_tune$min, # 19
niter = 100, nthread = 4)) # 20
The options in line 9 to line 15 are tuning parameters. The tuning function
will evaluate each combination of them and calculate the associated
cross-validated RMSE. The parameter set with the smallest RMSE will be contained
in the returned value, which can then be passed to $train()
(Line 19-20).
Finally, we can use the model object to do predictions. The code below shows how to predict ratings given by the first 20 users on the first 20 movies.
user = 1:20
movie = 1:20
pred = expand.grid(user = user, movie = movie)
test_set = data_memory(pred$user, pred$movie, index1 = TRUE)
pred$rating = r$predict(test_set, out_memory())
library(ggplot2)
ggplot(pred, aes(x = movie, y = user, fill = rating)) +
geom_raster() +
scale_fill_gradient("Rating", low = "#d6e685", high = "#1e6823") +
xlab("Movie ID") + ylab("User ID") +
coord_fixed() +
theme_bw(base_size = 22)
Performance
To make the best use of recosystem
, the parallel computing option nthread
should be used in the training and tuning step. Also, LIBMF and recosystem can
make use of some advanced CPU features to speed-up computation, if you
compile the package from source and turn on some compiler options.
To build recosystem
, one needs a C++ compiler that supports
the C++11 standard. Then you can edit src/Makevars
(src/Makevars.win
for Windows system)
according to the following guideline:
- The default
Makevars
provides generic options that should apply to most CPUs. - If your CPU supports SSE3 (a list of supported CPUs), add
PKG_CPPFLAGS += -DUSESSE
PKG_CXXFLAGS += -msse3
- If not only SSE3 is supported but also AVX (a list of supported CPUs), add
PKG_CPPFLAGS += -DUSEAVX
PKG_CXXFLAGS += -mavx
After editing the Makevars
file, run R CMD INSTALL recosystem
to install recosystem
.
The plot below shows the effect of parallel computing and the compiler option on the performance of computation. The y axis is the elapsed time of the model tuning procedure in the previous example.
References
[1] Chin, Wei-Sheng, Yong Zhuang, Yu-Chin Juan, and Chih-Jen Lin. 2015a. A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems. ACM TIST.
[2] Chin, Wei-Sheng, Yong Zhuang, Yu-Chin Juan, and Chih-Jen Lin. 2015b. A Learning-Rate Schedule for Stochastic Gradient Methods to Matrix Factorization. ACM TIST.
[3] Lin, Chih-Jen, Yu-Chin Juan, Yong Zhuang, and Wei-Sheng Chin. 2015. LIBMF: A Matrix-Factorization Library for Recommender Systems.
[4] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.