Techniques of matrixfactorization are used for discovering preference of a customer regarding items for which he has not expressed any preference so far. Such discovery of preference helps ecommerce sites (or newsportal sites) to recommend to customers items (or news) that he could consider purchasing (reading). ecommerce sites hosts millions of items and out of these items, a customer’s preference is known only for very few of such items. Thus, if a useritem preference (or rating) matrix is constructed, most of the ratings in this matrix would be 0 (or NA i.e. notavailable). A matrix such as this is known a sparse matrix. As most of the cells in this matrix are zero, to save space, many differing techniques are adopted to create sparse matrix. For example, if per row (i.e. per user), preference is expressed for say 100 items in columns of million items, then instead of storing 1 million integer values per row, one could just mention the rowno, colno and value of these 100 preferences, as below. Rest are implied to be of zero value.
[row_no_1 {(col_no, col_value), (col_no,col_value)....}] [row_no_2 {(col_no, col_value), (col_no,col_value)....}] [row_no_n {(col_no, col_value), (col_no,col_value)....}]
In this way, per row, say, for 100 preferences, one may have to store not more than 420 integers/characters including brackets etc instead of a million (zero) numbers (integers). The matrix, now, not only occupies less space but also and possibly the whole of matrix can be brought into RAM for further processing. Many ingenious ways have been devised to store sparse matrices. Wikipedia can be a good starting point for understanding them. Many of these storage techniques also have accompanying APIs to carry out common matrix operations directly on them such as matrix multiplication, transpose etc. Thus, a number of matrix operations can be speeded up using sparse matrices.
A useritem sparse rating matrix, V, can, in turn, be conceived of the result of multiplication of two matrices, one a userfeature preference matrix and the other itemfeature matrix. In a userfeature matrix of mobile, for example, there may be such features as color, weight etc. Same features may be there in an itemfeature matrix. Itemfeature matrix, may indicate for an item either presence or absence of a feature or its relative ‘level’ with respect to the value of same feature in an other similar item. Below, we have three tables. One, a userfeature matrix (4 X 4) at lower left, second, an itemfeature matrix (4 X 4) at topright and the third matrix on the bottomright is the product of the two matrices. NoEntry, Dabbang and Gunday are names of three Indian movies.
Tables: 1 to 3. NoEntry, Dabbang and Gunday are Hindi movie names
NoEntry  Dabbang  Gunday  
Comedy  1  0  0  
horror  0  1  1  
adult  0  0  0  
general  1  1  1  
comedy  horror  adult  general  NoEntry  Dabbang  Gunday  
user1  4  3  1  3  user1  7  6  6 
user2  5  3  3  2  user2  7  5  5 
user3  3  2  1  1  user3  4  3  3 
user4  2  1  2  2  user4  4  3  3 
You may recall that the matrixmultiplication of two matrices A and B (A.B) is done as follows:
The value in the first cell is calculated as the dot product of first row of A with first column of B; the value in the second cell on the right is the dotproduct of first row of A with the second column of B and so on. Thus, the first cell values will be: 4 X 1+3 X 0 + 1 X 0 +3 X 1= 7 and the second cellvalue is: 4 X 0 + 3X 1 + 1 X 0 + 1 X 3 = 6. Thus, each cell value is the dotproduct of corresponding userrow with itemcolumn. In general a productmatrix is calculated through a set of equations, as:
a11*b11 + a12 * b21 + a13 * b31 + a14 * b41 =7
a21*b12 + a22 * b22 + a23 * b32 + a24 * b42 =6


where a11, a12, a13 and a14 are columnwise coefficients of first row of matrix A and b11, b12, b13 and b14 are rowwise coefficients of first column of matrix B. Please refer to Wikipedia for further details.
In reverse, given just the product (generally called matrix V), it might be possible to solve the above set of equations (sixteen equations, one for each cell, in our case) for values of a11, a12, b11, b12 and so on i.e. be able to find the two multiplicand matrices, A and B.
Given a useritem rating matrix, we presume that there exists a (latent, not disclosed to us) userfeature matrix and an itemfeature matrix. We do not know what those latent features are and how many of them are there; all that we know is that features do exist. But why are we interested in factorizing a sparse useritem feature matrix into two useritem and featureitem matrices? Consider the following (not so) sparse initial useritem matrix. Our interest is to find user preference even for those items where it is not known (represented by 0).
Table: 4 Initial useritem rating matrix
NoEntry  Dabbang  Gunday  
user1  5  0  2 
user2  4  0  0 
user3  0  2  0 
user4  7  0  4 
Manually, by trial and error, we try to factorize it into two matrices. The following is the result (ignore a bit of sparseness in factors). Userfeature matrix (U) is as below. We do not know the names of these features. Therefore, we just name them f1f4.
Table: 5 Derived userfeature matrix
f1  f2  f3  f4  
user1  2  1  2  0 
user2  1  1  1  1 
user3  0  1  0  1 
user4  2  1  2  2 
And itemfeature matrix (M) of:
Table: 6 Derived itemfeature matrix
NoEntry  Dabbang  Gunday  
f1  1  0  0 
f2  1  1  0 
f3  1  0  1 
f4  1  1  1 
If we multiply the two matrixfactors in Table 5 and Table 6, we get the following useritem matrix:
Table: 7 Derived useritem matrix
NoEntry  Dabbang  Gunday  
user1  5  1  2 
user2  4  2  2 
user3  2  2  1 
user4  7  3  4 
Compare it with Table 4. We are able to fill in the blanks and get to know user’s preference even for those cases where his preference was not earlier known. Note that in this particular case the starting point, useritem matrix (Table 4), was not so sparse and enough information was able to factorize. Factors, on multiplication, yielded no error for already known ratings (bold numbers in Table 7). In real world, we are not so lucky. The initial useritem matrix is very very sparse. And effort is to factorize so that the rootmean square error (RMSE) for initial known ratings is minimized. There are many ways to achieve this minimum and hence we have a number of algorithms for matrix factorization.
But how did we know there would be four features? Well we did not know. We could have guessed five features and the result would have been as below (again ignore a bit of sparseness in factors). Note that the derived useritem ratings matrix is now different from the one at Table 7 above even though RMSE is still zero.
Tables: 810: Factorization with five features
No Entry  Dabbang  Gunday  
f1  1  0  0  
f2  1  1  0  
f3  1  0  1  
f4  1  1  1  
f5  1  0  1  
f1  f2  f3  f4  f5  No Entry  Dabbang  Gunday  
user1  2  1  1  0  1  user1  5  1  2 
user2  1  1  1  1  0  user2  4  2  2 
user3  1  1  1  1  1  user3  3  2  3 
user4  2  1  2  2  0  user4  7  3  4 
So how many features do we select? In practice, the training data is divided into two parts. Training data and test data. Training data is used to build a number of useritem matrices with varying number of features. For each result, RMSE is calculated for useritems ratings given in test data. That model is selected for which this RMSE is minimum.
The initial sparse useritem matrix in Table 4 can also be factored in the following way (again ignore a bit of sparseness in factors):
Tables: 1113: Another set of factors but with negative elements
No Entry  Dabbang  Gunday  
f1  1  0  0  
f2  1  1  0  
f3  1  0  1  
f4  1  1  1  
f5  2  0  1  
f1  f2  f3  f4  f5  No Entry  Dabbang  Gunday  
user1  2  1  1  1  2  user1  5  0  2 
user2  1  1  2  2  1  user2  4  3  3 
user3  1  2  0  4  1  user3  5  2  5 
user4  2  1  2  2  0  user4  7  3  4 
But this time we have negative preferences in userfeature matrix. Sparser the matrix, more the possible number of factors. Generally we go for Nonnegative matrix factorization (NMF) rather than for factors with negative elements in them. Factors with negative elements are difficult to interpret. Also numerical algorithms rather than equationsolving is used in matrix factorization. Hence the factorization is generally approximate rather than exact with nonnegativity imposing one constraint on factorization.
But why at all do we interpret factors of useritem matrix as userfeature matrix and itemfeature matrix? Maybe what we call userfeature matrix is in fact, useruser (somerelationship) matrix and so also for the other factor! We will explain this intuitively. Let us assume that they are really so and we will show that this assumption does help us understand calculation of a user’s rating by matrix multiplication. If V is m X n useritem matrix, W is m X k userfeature matrix and M is n X k itemfeature matrix, then:
V = W . transpose(M)
or, V =W.H
where, H = transpose(M)
k is the number of features. Generally k is much less than either m, the number of users, or n, the number of items. For example, in a 943 X 1063 useritem matrix, we may have two nonnegative factors of size 943 X 15 and 1063 X 15. The original matrix had 943 * 1063 = 1002409 elements while the factors have in all: 943*15 + 1063*15 = 30090 elements; this is about 30 times less than the original matrix. Factors, therefore, compress information.
Further, now every user’s preference for each feature is represented rowwise in matrix W. On the other hand, how much of each feature is contained in a particular item is represented column wise in H. To determine a user’s rating for an item, we multiply the specific row of the user with the particular column of the item or in other words user’s preference for a feature is weighted by how much of that feature is contained in that item and the result is summed up over all features. It is this intuitive interpretation of NMF that makes it useful for treating the two factors of V, W and H as being user(hidden) feature matrix and item(hidden) feature matrix rather than being something else. W is appropriately called ‘basis’ matrix while H is called ‘coefficient’ matrix. (As an aside note here that how much of a feature is contained in an item should be positive and not negative and hence nonnegative matrix factorization). The matrix:
Vi = W.hi
is a onecolumn list of respective ratings of all users for one item ‘i’ represented by (coefficient) column hi of H.
In the next blog I will explain how to factorize useritem rating matrix using Alternating Least Square method. We will use Mahout for the purpose.
Tags: matrix factorization, recommender engine, Tutorial on matrix factorization and recommender engine
Leave a Reply