Alternating Least Square Matrix factorization for Recommender engine using Mahout on hadoop–Part I

For a conceptual explanation of how matrix factorization applies to solving recommender problems, please see my earlier blog. Mahout offers Alternate Least Square algorithm for matrix factorization. User-item matrix, V, can be factorized into one, m X k, user-feature matrix, U, and an n X k item-feature matrix, M. Mahout ALS recommender engine can be used for large data sets that spread over many machines in hdfs environment. It thus enjoys advantage over techniques that require that all data fit into RAM of a single machine.

Mahout’s ALS recommender has the further advantage of making recommendations when user-item preferences are not explicit but implicit. In an explicit feedback environment, a user explicitly states his rating for an item. In an implicit feedback preference environment, user’s preference for an item is gathered when, for example, he shows an interest in an item by clicking relevant links or when he purchases an item or when he searches for an item. In an explicit feedback, rating scale can have any range, say, 1 to 5. In implicit feedback rating is either 0 or 1; maximum rating is 1.

For our experiments, we will use the MovieLens data set ml-100k. The data set consists of 100,000 ratings from 943 users for 1682 movies on a scale of 1-5. Each user has rated at least 20 movies. We will work with ‘u1.base’ file in this data set. File, u1.base, has data in the following tab-separated format:

1	1	5	874965758
1	2	3	876893171
1	3	4	878542960
2	1	4	888550871
2	10	2	888551853
2	14	4	888551853
2	19	3	888550871
3	317	2	889237482
3	318	4	889237482
3	319	2	889237026
4	210	3	892003374
4	258	5	892001374
4	271	4	892001690

The first column is userid, second column is movieid, the third is rating and the last is the time stamp. We are not concerned with time stamp here and hence we will remove this column. The following awk script will do this for us:

awk '{print $1"\t"$2"\t"$3}' u1.base > uexp.base

We will upload user-movie rating file ‘uexp.base’ to hadoop and then use mahout to first build user-item rating matrix and then factorize it. ‘mahout parallelALS’ can be used for matrix factorization. It uses ALS technique. To see what all are the arguments to ‘mahout parallelALS’, on a command line, issue the command:

mahout  parallelALS  –h

In our case, we have installed Cloudera hadoop eco-system on our machine. Mahout gets installed automatically with Cloudera. Of course, some initial configuration does need to be made. But we assume that mahout is configured and working on your system.

The following code snippet builds user-item matrix and factorizes it. The code is highly commented to make it understand.

#!/bin/bash
# We will use u1.base file which contains ratings data

#######Some constants##########
# Folder in local file system where user-rating file, u1.base, is placed
datadir="/home/ashokharnal/Documents/datasets/Movielens"
localfile=$datadir/"uexp.base"

# Folder in hadoop-filesystem to store user-rating file
ddir="/user/ashokharnal"
hdfs_movie_file=$ddir/"uexp.base"
# Folder where calculated factor matrices will be placed
out_folder=$ddir/"uexp.out"
# Temporary working folder in hdfs
temp="/tmp/itemRatings"
cd $datadir

# Remove time-stamps from user-rating file 
awk '{print $1"\t"$2"\t"$3}' u1.base > uexp.base

# Copy rating file from local filesystem to hadoop
hdfs dfs -put $localfile $ddir/
# Check if file-copying successful
hdfs dfs -ls $ddir/uexp.base

# Delete temporary working folder in hadoop, if it already exists
hdfs dfs -rm -r -f $temp

# Build model from command line.
mahout parallelALS --input $hdfs_movie_file \
                   --output $out_folder \
                   --lambda 0.1 \
                   --implicitFeedback false \
                   --alpha 0.8 \
                   --numFeatures 15 \
                   --numIterations 100 \
                   --tempDir $temp

##Parameters in mahout are
# lambda            Regularization parameter to avoid overfitting
# implicitFeedback  User's preference is implied by his purchases (true) else false
# alpha             How confident are we of implicit feedback (used only when implicitFeedback is true)
#                   Of course, in the above case it is 'false' and hence alpha is not used.
# numFeatures       No of user and item features
# numIterations     No of iterations 
# tempDir           Location of temporary working files 

The above script writes a user-feature matrix under folder (on hadoop), /user/ashokharnal/uexp.out/U/, an item-feature matrix under folder, /user/ashokharnal/uexp.out/M/. Matrix files under folders, U and M are in (binary) sequence file format. There is another folder /user/ashokharnal/uexp.out/userRatings/. Sequence files in this folder contain already known user ratings for various movieids.

We can now use user-feature matrix and item-feature matrix to calculate for any user top-N recommendations. For this purpose userids need to be stored in sequence file format where userid is the key and movieid (which will be ignored) is the value. We write the following text file:

1,6
2,13
3,245
4,50
46,682

In this file userids 1, 2, 3, 4, 46 are of interest while movieids written against them are arbitrary numbers. We are interested in finding out top-N recommendations for these userids. We convert this file to a sequence file using following Java code. There is presently no command line tool available from mahout to produce such file. The code is highly commented for ease of understanding. You can use NetBeans IDE to write, build and run this code. See this link.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.VectorWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.mahout.math.Vector;

public class VectorsFileForRecommender
	{
	public static void main(String[] args) throws IOException
		{
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		String input;
                String output;
                String line;
                input = "/home/ashokharnal/keyvalue.txt";
                output = "/home/ashokharnal/part-0000";

                // Create a file reader object
		BufferedReader reader = new BufferedReader(new FileReader(input));
                // Create a SequenceFile writer object
		SequenceFile.Writer writer = new SequenceFile.Writer( fs,conf,new Path(output), IntWritable.class, VectorWritable.class);
		
                // Read lines of input files, one record at a time
		while ((line = reader.readLine()) != null)
                    {
                    String[] rec;                           // A string array. 
                    rec = line.split(",");                  // Split line at comma delimiter and fill the string array 
                    double[] d = new double[rec.length];    // A double array of dimensaion rec.length
                    d[0] = Double.parseDouble(rec[0]);      // Double conversion needed for creating vector 
                    d[1] = Double.parseDouble(rec[1]);

                    // We will print, per record, lots of outputs to bring clarity 
                    System.out.println("------------------");
                    System.out.println("rec array length: "+rec.length);
                    System.out.println("userid: "+rec[0]);
                    System.out.println("Movieid: "+Double.parseDouble(rec[1]));
                  
                    // Create a Random access sparse vector. A random access sparse vector
                    //  only stores non-zero values and any value can be accessed randomly
                    //    as against sequential access.
                    // Class, RandomAccessSparseVector, implements vector to 
                    //   store non-zero values of type doubles
                    // We may either create a RandomAccessSparseVector object of size just 1, as:
                    //   Vector vec = new RandomAccessSparseVector(1);
                    // Or, create a RandomAccessSparseVector of size 2, as:
                    Vector vec = new RandomAccessSparseVector(rec.length);
                    
                    // Method ,assign(), applies the function to each element of the receiver
                    //   If RandomAccessSparseVector size is just 1, we may assign to it
                    //      either userid or movied. For example: vec.assign(d[1]) ;. 
                    // Argument to assign() can only be a double array or a a double value
                    //   or a vector but not integer or text.
                    vec.assign(d);      // Assign a double array to vector
                 
                    // Prepare for writing the vector in sequence file
                    //    Create an object of class VectorWritable and set its value
                    VectorWritable writable = new VectorWritable();
                    writable.set(vec);
                                        
                    // Check vector size
                    System.out.println("Vector size: "+ vec.size());
                    // Check vector value
                    System.out.println("Vector value: "+ vec.toString());
                    // Check what is actually being written to sequence file
                    System.out.println("Vector value being written: "+writable.toString());
                    System.out.println("Key value being written: "+d[0]);
                    
                    // Mahout sequence file for 'recommendfactorized' requires that key be of class IntWritable
                    //   and value which is ignored be of class VectorWritable.
                    // Append now line-input to sequence file in either way:
                    //   writer.append( new IntWritable(Integer.valueOf(rec[0])) , writable);
                    //    OR
                    writer.append( new IntWritable( (int) d[0]) , writable);
                    // Note: As value part of sequencefile is ignored, we could have written just
                    //       any arbitrary number to it instead of rec-array.
                    }
                writer.close();
                }
        }

This code produces a sequence file /home/ashokharnal/part-0000 on local file system. Copy this to hadoop to, say, /user/ashokharnal/part-0000. You can dump the content of this sequence file to a text file in your local file system using the following mahout command:

mahout seqdumper -i /user/ashokharnal/part-0000   -o /home/ashokharnal/dump.txt

File, dump.txt, has the following contents:

[ashokharnal@master ~]$ cat dump.txt
Input Path: /user/ashokharnal/part-0000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable
Key: 1: Value: {1:6.0,0:1.0}
Key: 2: Value: {1:13.0,0:2.0}
Key: 3: Value: {1:245.0,0:3.0}
Key: 4: Value: {1:50.0,0:4.0}
Key: 46: Value: {1:682.0,0:46.0}
Count: 5

As mentioned above, key is in ‘IntWritable’ class while value part in curly brackets is in ‘VectorWritable’ class. That key be in ‘IntWritable’ class and value be in ‘VectorWritable’ class is a requirement for our next step. We now use ‘mahout recommendfactorized‘ command to make top-N recommendations for specified users. The script is as follows:

# This script is in continuation of earlier bash script
#  Dollar constants used are as in the earlier bash script
mahout recommendfactorized \
       --input $ddir/part-0000  \
       --userFeatures $out_folder/U/ \
       --itemFeatures $out_folder/M/ \
       --numRecommendations 15 \
       --output /tmp/topNrecommendations \
       --maxRating 5
       
##Parameters
# input                   Sequence file with userids
# userFeatures            user-feature matrix
# itemFeatures            item-features matrix
# numRecommendations      top-N recommendations to make per user listed in the input file
# output                  top-N recommendations in descending order of importance
# maxRating               Maximum possible rating for any item

You can now open the text file under the hadoop folder ‘/tmp/topNrecommendations’. The top-N recommendations appear as follows:

1   [1449:5.0,119:4.8454704,169:4.837305,408:4.768415,474:4.710996,50:4.6945467,1142:4.692111,694:4.646718,127:4.614179,174:4.6061845,513:4.6046524,178:4.6008058,483:4.5722823,1122:4.5680165,12:4.5675783]
2	[1449:5.0,1512:4.9012074,1194:4.900322,1193:4.751229,242:4.7209125,178:4.717094,318:4.702426,661:4.700798,427:4.696252,302:4.6929398,357:4.6668787,1064:4.642539,603:4.6239715,98:4.5959983,694:4.5866184]
3	[902:5.0,320:4.716057,865:4.540695,1143:4.36101,340:4.310355,896:4.307236,179:4.195304,1368:4.194901,512:4.1928997,345:4.169366,1642:4.136083,445:4.094295,321:4.0908194,133:4.085527,423:4.06355]
4	[251:5.0,253:5.0,1368:5.0,190:5.0,1137:5.0,320:5.0,223:5.0,100:5.0,1449:5.0,1005:5.0,1396:5.0,10:5.0,1466:5.0,1099:5.0,1642:5.0]
46	[958:5.0,512:5.0,1449:5.0,1159:5.0,753:5.0,170:5.0,1642:5.0,408:5.0,1062:5.0,114:5.0,745:5.0,921:5.0,793:5.0,515:5.0,169:4.9880037]

In fact, if you are interested, you can build top-N recommendations for all users in one go using user-ratings (sequence) file (containing already known user-ratings as in file uexp.base) under folder ‘userRatings’. Recall that this folder was created by the earlier mahout script. The following is the mahout script.

mahout recommendfactorized \
       --input /user/ashokharnal/uexp.out/userRatings/  \
       --userFeatures $out_folder/U/ \
       --itemFeatures $out_folder/M/ \
       --numRecommendations 15 \
       --output /tmp/topNrecommendations \
       --maxRating 5

Incidentally, mahout does not necessarily need hadoop. It will work on local file system even if you do not have hadoop installed. To make mahout work on local file system and ignore hadoop altogether, set the value of variable, MAHOUT_LOCAL to any value, as:

export MAHOUT_LOCAL="abc"
 |
 |
# To make it work on hadoop, unset the value
export MAHOUT_LOCAL=""

It is better to set the value in ~/.bashrc and then run it as: source ~./bashrc

While we now know the top-N recommendations for a user, what if we were to find rating for a movie not in the top-N. In the next blog we will cover this aspect i.e. find predicted user-rating for any (userid,movieid).

That finishes this. Have a cup of coffee!

Advertisements

Tags: , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: