Alternating Least Squares with implicit feedback – The search for alpha


So you want to build a recommendation engine with ALS… You search the internet for some example code in your language of choice… You copy paste the code and start tweaking… But then you realize that your data is different from all the examples you found online. You don’t have explicit ratings in some range from 1 to 10; instead, you have click events where 1 means ‘clicked’. Will you still be able to use ALS? And if so, how?

A brief recap on Collaborative Filtering and Alternating Least Squares

Collaborative Filtering is the technique of predicting the preference of a user by collecting the preferences of many users. The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B’s opinion on a different issue $i$ than to have the opinion on $i$ of a person chosen randomly (Wikipedia). A preference takes the form of a (useritemrating) triple. Collecting them yields a sparse matrix $R_{u \times i}$ of known ratings of users $u$ for items $i$. The task is to fill in the missing values. In the Latent Model approach to Collaborative Filtering, we do this by decomposing $R_{u \times i}$ into a user matrix $X_{u \times g}$ and an items matrix $Y_{i \times g}$ so that we can find the ‘hidden’ features $g$ of users and items. (In the case of movies one could think of these hidden features as genres.) By taking the product of the user matrix and item matrix, we can reconstruct the (complete) ratings matrix $\hat{R} = X \cdot Y^T$. Or for individual ratings: $\hat{r}_{ui} = x_{u}^T y_i$ To compute these factors, we will first randomly initialize $X$ and $Y$ and iteratively recompute them by minimizing the loss function $L$: [latex] \sum\limits_{u,i} (r_{ui} – x_u^T y_i)^{2} + \lambda \Big( \sum\limits_u \|x_u\|^{2} + \sum\limits_i \|y_i\|^{2} \Big) [/latex] The first term in $L$ is the sum of squared errors and the second term is used for regularization. In order to minimize the loss function, we will take the derivatives with respect to $x$ and $y$ and solve for 0. [latex] \begin{aligned} \frac{\partial L}{\partial x_u} &= 0 \\ -2\sum\limits_i(r_{ui} – x_u^T y_i) y_i^T + 2 \lambda x_u^T &= 0 \\ -2 (r_u^T – x_u^T Y^T)Y + 2 \lambda x_u^T &= 0 \\ -2 r_u^T Y + 2 x_u^T Y^T Y + 2 \lambda x_u^T &= 0 \\ x_u^T Y^T Y + \lambda x_u^T &= r_u^T Y \\ x_u^T \big( Y^T Y + \lambda I \big) &= r_u^T Y \\ \big( Y^T Y + \lambda I \big) x_{u} &= Y^T r_u \\ x_u &= \big( Y^T Y + \lambda I \big)^{-1} Y^T r_u \end{aligned} [/latex] And for y: [latex] \begin{aligned} \frac{\partial L}{\partial y_i} &= 0 \\ -2\sum\limits_i(r_{ui} – y_i^T x_u) x_u^T + 2 \lambda y_i^T &= 0 \\ y_i &= \big( X^T X + \lambda I \big)^{-1} X^T r_i \end{aligned} %y_i &= \big( X X^T + \lambda I \big)^{-1} X r_i^T [/latex] Recomputing $x_{u}$ and $y_i$ can be done with Stochastic Gradient Descent, but this is a non-convex optimization problem. We can convert it into a set of quadratic problems, by keeping either $x_u$ or $y_i$ fixed while optimizing the other. In that case, we can iteratively solve $x$ and $y$ by alternating between them until the algorithm converges. This is Alternating Least Squares.

Implicit Feedback

Unfortunately, you don’t have ratings, you have click events. And a ‘click’ event does not necessarily mean a user really likes the product; the user could be curious about the product for some other reason. Even when you are using ‘buy’ events you are not in the clear, because people buy gifts for other people all the time. Furthermore, the absence of a click event does not imply a dislike for a product $i$. So can you still use ALS? Yes, you can still use ALS, but you have to take into account the fact that you have implicit ratings/feedback. Luckily, your preferred machine learning library shows there is an ‘implicit’ switch on the ALS interface and that there is an ‘alpha’ parameter involved as well.

So what is this $\alpha$ character?

To understand alpha, we should go to the source, which is Hu et al. 2008 (1). He suggests to split each rating into a preference and a confidence level. The preference is calculated by capping the rating to a binary value. [latex] p_{ui} = \begin{cases} 1, & r_{ui} > 0 \\ 0, & r_{ui} = 0 \end{cases} [/latex] The confidence level is defined as: [latex]c_{ui} = 1 + \alpha r_{ui}[/latex] For a rating of 0 we would have a minimum confidence of 1 and if the rating increases, the confidence increases accordingly. The rate of increase is controlled by alpha. So alpha reflects how much we value observed events versus unobserved events. Factors are now computed by minimizing the following loss function L: [latex] \sum\limits_{u,i} c_{ui} (p_{ui} – x_{u}^{T}y_i)^{2} + \lambda \Big( \sum\limits_u \|x_u\|^{2} + \sum\limits_i \|y_i\|^{2} \Big) [/latex] Now suppose that for a given rating $r_{ui}$ that $x_{u}^T y_i$ is very large, so that the squared residual $(p_{ui} – x_{u}^{T}y_i)^{2}$ is very large, then the rating $r_{ui}$ has a big impact on our loss function. And it should! $x_{u}^T y_i$ will be drawn towards the 0-1 range, which is a good thing, because we want to predict whether the event will occur or not (0 or 1). Alternatively, suppose that for a given rating $r_{ui}$ we have observed many events, and suppose also that our initial $x_{u}^T y_i$ value is close to 0, so that the squared residual $(p_{ui} – x_{u}^{T}y_i)^{2}$ approximates 1, then the rating $r_{ui}$ will still have quite some impact on our loss function, because our confidence $c_{ui}$ is large. Again, this is a good thing, because in this case, we want $x_{u}^T y_i$ to go towards 1. If either the confidence level is low or the residual is low, there is not much impact on the loss function, so the update of $x_u$ and $y_i$ will be small.


Now that we have some background on this alpha, we can safely copy-paste the recommender engine code we found online and expand it so that it includes the alpha parameter. All that is left now is some extra parameter tuning and we are done! After that, we can run our final model on the test set and we can calculate the root mean squared error… Wait.. What..? Somehow, that metric just doesn’t feel right. Oh, well… Enough for today 🙂


(1) Y. Hu, Y. Koren and C. Volinsky, “Collaborative Filtering for Implicit Feedback Datasets”, 2008