## Matrix Factorization for Recommender engine–A simple explanation

Techniques of matrix-factorization are used for discovering preference of a customer regarding items for which he has not expressed any preference so far. Such discovery of preference helps e-commerce sites (or news-portal sites) to recommend to customers items (or news) that he could consider purchasing (reading). e-commerce sites hosts millions of items and out of these items, a customer’s preference is known only for very few of such items. Thus, if a user-item preference (or rating) matrix is constructed, most of the ratings in this matrix would be 0 (or NA i.e. not-available). A matrix such as this is known a sparse matrix. As most of the cells in this matrix are zero, to save space, many differing techniques are adopted to create sparse matrix. For example, if per row (i.e. per user), preference is expressed for say 100 items in columns of million items, then instead of storing 1 million integer values per row, one could just mention the rowno, colno and value of these 100 preferences, as below. Rest are implied to be of zero value.

```[row_no_1  {(col_no, col_value), (col_no,col_value)....}]
[row_no_2 {(col_no, col_value), (col_no,col_value)....}]
[row_no_n  {(col_no, col_value), (col_no,col_value)....}]
```

In this way, per row, say, for 100 preferences, one may have to store not more than 420 integers/characters including brackets etc instead of a million (zero) numbers (integers). The matrix, now, not only occupies less space but also and possibly the whole of matrix can be brought into RAM for further processing. Many ingenious ways have been devised to store sparse matrices. Wikipedia can be a good starting point for understanding them. Many of these storage techniques also have accompanying APIs to carry out common matrix operations directly on them such as matrix multiplication, transpose etc. Thus, a number of matrix operations can be speeded up using sparse matrices.

A user-item sparse rating matrix, V, can, in turn, be conceived of the result of multiplication of two matrices, one a user-feature preference matrix and the other item-feature matrix. In a user-feature matrix of mobile, for example, there may be such features as color, weight etc. Same features may be there in an item-feature matrix. Item-feature matrix, may indicate for an item either presence or absence of a feature or its relative ‘level’ with respect to the value of same feature in an other similar item. Below, we have three tables. One, a user-feature matrix (4 X 4) at lower left, second, an item-feature matrix (4 X 4) at top-right and the third matrix on the bottom-right is the product of the two matrices. No-Entry, Dabbang and Gunday are names of three Indian movies.

Tables: 1 to 3. No-Entry, Dabbang and Gunday are Hindi movie names

 No-Entry Dabbang Gunday Comedy 1 0 0 horror 0 1 1 adult 0 0 0 general 1 1 1 comedy horror adult general No-Entry Dabbang Gunday user1 4 3 1 3 user1 7 6 6 user2 5 3 3 2 user2 7 5 5 user3 3 2 1 1 user3 4 3 3 user4 2 1 2 2 user4 4 3 3

You may recall that the matrix-multiplication of two matrices A and B (A.B) is done as follows:

The value in the first cell is calculated as the dot product of first row of A with first column of B; the value in the second cell on the right is the dot-product of first row of A with the second column of B and so on. Thus, the first cell values will be: 4 X 1+3 X 0 + 1 X 0 +3 X 1= 7 and the second cell-value is: 4 X 0 + 3X 1 + 1 X 0 + 1 X 3 = 6. Thus, each cell value is the dot-product of corresponding user-row with item-column. In general a product-matrix is calculated through a set of equations, as:

a11*b11 + a12 * b21 + a13 * b31 + a14 * b41 =7
a21*b12 + a22 * b22 + a23 * b32 + a24 * b42 =6
|
|

where a11, a12, a13 and a14 are column-wise coefficients of first row of matrix A and b11, b12, b13 and b14 are row-wise coefficients of first column of matrix B. Please refer to Wikipedia for further details.

In reverse, given just the product (generally called matrix V), it might be possible to solve the above set of equations (sixteen equations, one for each cell, in our case) for values of a11, a12, b11, b12 and so on i.e. be able to find the two multiplicand matrices, A and B.

Given a user-item rating matrix, we presume that there exists a (latent, not disclosed to us) user-feature matrix and an item-feature matrix. We do not know what those latent features are and how many of them are there; all that we know is that features do exist. But why are we interested in factorizing a sparse user-item feature matrix into two user-item and feature-item matrices? Consider the following (not so) sparse initial user-item matrix. Our interest is to find user preference even for those items where it is not known (represented by 0).

Table: 4 Initial user-item rating matrix

 No-Entry Dabbang Gunday user1 5 0 2 user2 4 0 0 user3 0 2 0 user4 7 0 4

Manually, by trial and error, we try to factorize it into two matrices. The following is the result (ignore a bit of sparseness in factors). User-feature matrix (U) is as below. We do not know the names of these features. Therefore, we just name them f1-f4.

Table: 5 Derived user-feature matrix

 f1 f2 f3 f4 user1 2 1 2 0 user2 1 1 1 1 user3 0 1 0 1 user4 2 1 2 2

And item-feature matrix (M) of:

Table: 6 Derived item-feature matrix

 No-Entry Dabbang Gunday f1 1 0 0 f2 1 1 0 f3 1 0 1 f4 1 1 1

If we multiply the two matrix-factors in Table 5 and Table 6, we get the following user-item matrix:

Table: 7 Derived user-item matrix

 No-Entry Dabbang Gunday user1 5 1 2 user2 4 2 2 user3 2 2 1 user4 7 3 4

Compare it with Table 4. We are able to fill in the blanks and get to know user’s preference even for those cases where his preference was not earlier known. Note that in this particular case the starting point, user-item matrix (Table 4), was not so sparse and enough information was able to factorize. Factors, on multiplication, yielded no error for already known ratings (bold numbers in Table 7). In real world, we are not so lucky. The initial user-item matrix is very very sparse. And effort is to factorize so that the root-mean square error (RMSE) for initial known ratings is minimized. There are many ways to achieve this minimum and hence we have a number of algorithms for matrix factorization.

But how did we know there would be four features? Well we did not know. We could have guessed five features and the result would have been as below (again ignore a bit of sparseness in factors). Note that the derived user-item ratings matrix is now different from the one at Table 7 above even though RMSE is still zero.

Tables: 8-10: Factorization with five features

 No Entry Dabbang Gunday f1 1 0 0 f2 1 1 0 f3 1 0 1 f4 1 1 1 f5 1 0 1 f1 f2 f3 f4 f5 No Entry Dabbang Gunday user1 2 1 1 0 1 user1 5 1 2 user2 1 1 1 1 0 user2 4 2 2 user3 1 1 1 1 1 user3 3 2 3 user4 2 1 2 2 0 user4 7 3 4

So how many features do we select? In practice, the training data is divided into two parts. Training data and test data. Training data is used to build a number of user-item matrices with varying number of features. For each result, RMSE is calculated for user-items ratings given in test data. That model is selected for which this RMSE is minimum.

The initial sparse user-item matrix in Table 4 can also be factored in the following way (again ignore a bit of sparseness in factors):

Tables: 11-13: Another set of factors but with negative elements

 No Entry Dabbang Gunday f1 1 0 0 f2 1 1 0 f3 1 0 1 f4 1 1 1 f5 2 0 1 f1 f2 f3 f4 f5 No Entry Dabbang Gunday user1 2 -1 1 -1 2 user1 5 0 2 user2 1 1 2 2 -1 user2 4 3 3 user3 1 -2 0 4 1 user3 5 2 5 user4 2 1 2 2 0 user4 7 3 4

But this time we have negative preferences in user-feature matrix. Sparser the matrix, more the possible number of factors. Generally we go for Non-negative matrix factorization (NMF) rather than for factors with negative elements in them. Factors with negative elements are difficult to interpret. Also numerical algorithms rather than equation-solving is used in matrix factorization. Hence the factorization is generally approximate rather than exact with non-negativity imposing one constraint on factorization.

But why at all do we interpret factors of user-item matrix as user-feature matrix and item-feature matrix? Maybe what we call user-feature matrix is in fact, user-user (some-relationship) matrix and so also for the other factor! We will explain this intuitively. Let us assume that they are really so and we will show that this assumption does help us understand calculation of a user’s rating by matrix multiplication. If V is m X n user-item matrix, W is m X k  user-feature matrix and M is n X k item-feature matrix, then:

V = W . transpose(M)
or,        V =W.H
where, H = transpose(M)

k is the number of features. Generally k is much less than either m, the number of users, or n, the number of items. For example, in a 943 X 1063 user-item matrix, we may have two non-negative factors of size 943 X 15 and 1063 X 15. The original matrix had 943 * 1063 = 1002409 elements while the factors have in all: 943*15 + 1063*15 = 30090 elements; this is about 30 times less than the original matrix. Factors, therefore, compress information.

Further, now every user’s preference for each feature is represented row-wise in matrix W. On the other hand, how much of each feature is contained in a particular item is represented column wise in H. To determine a user’s rating for an item, we multiply the specific row of the user with the particular column of the item or in other words user’s preference for a feature is weighted by how much of that feature is contained in that item and the result is summed up over all features.  It is this intuitive interpretation of NMF that makes it useful for treating the two factors of V, W and H as being user-(hidden) feature matrix and item-(hidden) feature matrix rather than being something else.  W is appropriately called ‘basis’ matrix while H is called ‘coefficient’ matrix. (As an aside note here that how much of a feature is contained in an item should be positive and not negative and hence non-negative matrix factorization). The matrix:

Vi = W.hi

is a one-column list of respective ratings of all users for one item ‘i’ represented by (coefficient) column hi of H.

In the next blog I will explain how to factorize user-item rating matrix using Alternating Least Square method. We will use Mahout for the purpose.