Alternating Least Square Matrix factorization for Recommender engine using Mahout on hadoop–Part II

In our last blog we used Mahout’s ALS algorithm for discovering top-N recommendations for a user. In this blog we intend to find out predicted rating for any (userid, movieid) combination not necessarily in top-N. We will continue our work from where we concluded in the last blog. We are given a text file (‘uexp.test’) as below:

ID,user,movie
110,1,6
111,1,10
112,1,12
113,1,14
114,1,17
115,2,13
116,2,50

Our task is to create a text file with just ID and rating, something as below:

ID,rating
110,4
111,4
112,5
113,1
114,2
115,3
116,4

Specifically, from our work in the last blog, we will be using the results of command ‘mahout parallelALS ‘. You may like to peruse the earlier blog before proceeding with this one.

# Build model from command line.
mahout parallelALS --input $hdfs_movie_file \
                   --output $out_folder \
                   --lambda 0.1 \
                   --implicitFeedback false \
                   --alpha 0.8 \
                   --numFeatures 15 \
                   --numIterations 100 \
                   --tempDir $temp

##Parameters in mahout are
# lambda            Regularization parameter to avoid over-fitting
# implicitFeedback  User's preference is implied by his purchases (true) else false
# alpha             How confident are we of implicit feedback (used only when implicitFeedback is true)
#                   Of course, in the above case it is 'false' and hence alpha is not used.
# numFeatures       No of user and item features
# numIterations     No of iterations
# tempDir           Location of temporary working files

The outputs of this command are as below. All files in folders are sequence files:
1. user-feature matrix files in folder, $out_folder/U/, in hadoop
2. item-feature matrix files in folder, $out_folder/M/, in hadoop
3. user’s already known ratings for movies in folder, $out_folder/userRatings/

Steps that we follow are these:
1. Dump contents of U to see how many rows and columns this matrix contains
2. Dump contents of M to see how many rows and columns this matrix contains
3. Transpose user-feature matrix
4. Transpose item-feature matrix
5. Multiply the two transposed matrices using mahout’s command line tool to get user-item matrix. Mahout’s multiplication tool multiplies two matrices A and B as transpose(A) .B and not as A.B. Hence step 3 above.
6. Dump the results of multiplication to text files on local file system
7. Use bash-script to extract ratings for relevant (userid, movieid)
8. Write rating to a new file with ID and rating as mentioned earlier.

We use ‘mahout seqdumper‘ command to dump contents of two files under M to M0.txt and M1.txt.

# Dump M to see how many items are there
mahout seqdumper -i /user/ashokharnal/uexp1.out/M/part-m-00000 \
                 -o /home/ashokharnal/M0.txt
mahout seqdumper -i /user/ashokharnal/uexp1.out/M/part-m-00001 \
                 -o /home/ashokharnal/M1.txt

Open the two files, M0.txt and M1.txt, and read the count of Keys at the end; this number is the number of rows. Number of values gives number of columns. For example, sample (truncated) contents of ‘M0.txt’ are as follows:

Input Path: /user/ashokharnal/uexp1.out/M/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable
Key: 1: Value: {0:1.974276012,1:0.0831050238,2:0.273167938,3:0.1869148199,4:-0.1584698196,5:-0.1129896600,6:-0.237190063,7:-0.155191340,8:0.388374577,9:0.2465730276,10:0.00144229714,11:0.515006337,12:0.1919460605,13:0.1356261162,14:0.3860959700}
Key: 3: Value: {0:1.4980332750521874,1:0.13121572246681273,2:0.4705798204432542,3:-0.290794568454,4:0.76276377,5:-0.264316266,6:0.2338155050,7:0.328942047,8:-0.105526130,9:-0.063718423,10:0.219792438,11:-0.56082225,12:0.245130819,13:-0.32558446,14:0.67558142}
 |
 |
Count: 826

Similarly, in file ‘M1.txt’ Count is 824. Thus, total number of rows are 826+824=1650. And number of columns (no of values from 0 to 14) are 15. Dimensions of item-feature matrix are: 1650 X 15. 1650 is the number of items. In the same manner dump contents of U as below:

mahout seqdumper -i /user/ashokharnal/uexp1.out/U/part-m-00000  \
                 -o /home/ashokharnal/U0.txt
mahout seqdumper -i /user/ashokharnal/uexp1.out/U/part-m-00001  \
                 -o /home/ashokharnal/U1.txt

Dimensions of user-feature matrix are (471+472) X 15 or 943 X 15. 943 is the number of users. Our objective is to get user-item matrix, V, as below:

V = U. transpose(M)

However, mahout’s matrix multiplication tool multiplies matrices A.B as transpose(A).B. As transpose(transpose(A)) = A, we first transpose U as follows:

mahout transpose  -nr 943 -nc 15  \
                  -i /user/ashokharnal/uexp1.out/U/  \
                  --tempDir /tmp/transposeU

The transposed matrix folder will be generated in the same place where folder, U, is with a name something like ‘transpose-33’. In our case the transposed matrix location is: /user/ashokharnal/uexp1.out/transpose-33/. Similarly, transpose M with the command:

mahout transpose  -nr 1650 -nc 15   \
                  -i /user/ashokharnal/uexp1.out/M/  \
                  --tempDir /tmp/transposeM

The transposed matrix is created at: /user/ashokharnal/uexp1.out/transpose-195/.

We now have to multiply the two transposed matrices. We use ‘mahout matrixmult‘ utility as follows. (To get help on it, run the command: ‘mahout matrixmult -h‘ )

mahout matrixmult -nra 15 -nca 943 -nrb 15  -ncb 1650  \
                  -ia /user/ashokharnal/uexp1.out/transpose-33/  \
                  -ib  /user/ashokharnal/uexp1.out/transpose-195  \
                  --tempDir /tmp/useless

The matrix product folder is generated under /tmp folder with name something like: /tmp/productWith-35/. To work with this user-item matrix using bash, we dump the (sequence) product files to local file system, as follows:

mahout seqdumper -i /tmp/productWith-35/part-00000  \
                 -o /home/ashokharnal/p0.txt
mahout seqdumper -i /tmp/productWith-35/part-00001  \
                 -o /home/ashokharnal/p1.txt

Truncated output for one key (user) looks like as below. You can interpret it as follows: For a userid, 4, and movieid,  5, rating is 3.5528462109 or 4.

Key: 4: Value: {1:3.9332423190,2:3.3451704447,3:2.3199345720,4:3.779028659,5:3.5528462109,6:4.4076660033, ..... ......... ,1680:1.4181067023,1681:2.4794630145,1682:3.115888046}

By looking at the values above and reading them you might think that this matrix has 1 to 1682 values or 1682 columns. This is not so as many of ‘itemids’ are missing. Number of columns would be 1650 (I have not counted but you are free to do so). Adding up ‘Count’ values written at the end of two files (‘p0.txt’ and ‘p1.txt’), you would find the number of rows as: 943. To make our search easier for a particular userid (key) and particular itemid (value), we concatenate the two files as below:

# First remove the first two informative lines and the last (Count) line from the two files
sed '1,2d' p0.txt | sed '$d' > p00.txt
sed '1,2d' p1.txt | sed '$d' > p01.txt
# Append the first to second
cat p00.txt >> p01.txt

We now have the complete V matrix in file ‘p01.txt’ but in (Key,Value) format. We have a ‘uexp.test’ file (see sample file at the top of this blog). Before using this file we remove the first line (header). We use the following shell script to read userids and movieids from this file and then extract the corresponding rating value from ‘p01.txt’. The bash script is highly commented for easy understanding.

#!/bin/bash
##############
# Use as:
# cat uexp.test | ./extractandwrite.sh
# Format of uexp.test: 'ID,userid,movieid' but no header
##############

## Some constants. File 'p01.txt' contains matrix-product
input_file="/home/ashokharnal/p01.txt"
output_file="/home/ashokharnal/submitoutput.txt"

# Begin reading file 'uexp.test' line by line
while read line
do
    #1. Extract ID, userid and movieid from 1st, 2nd and 3rd fields
    ID=$(echo $line | awk 'BEGIN {FS=","} ; {print $1}' )
    userid=$(echo $line | awk 'BEGIN {FS=","} ; {print $2}' )
    movieid=$(echo $line | awk 'BEGIN {FS=","} ;{print $3}' )

    #2. Feed file 'p01.txt' to awk. Each line has four space-separated fields, as:
    #    'Key: 4: Value: {1:3.9332423190590364,2:3.3451704447958877}'
    #     For a userid ('4' above), assign 4th field to user_preferences
    user_preferences=$(awk -v  myvar="$userid:"   ' { if ($2==myvar) print $4  } ' $input_file)

    #3. Look now for movieid within user_preferences and extract colon separated rating.
    #    Shell utilities 'tr' and 'awk' are used. 'tr' translates three
    #      symbols ',' '{' & '}' (i.e. comma and two curly brackets) with newline ('\n').
    #       Consequently single line of variable user_preferences gets split into multiple lines.
    #          Each split-line contains a pair like 23:2.33456 i.e. just movieid:rating.
    #            Pipe each of the pairs into awk and match 1st field with movieid.
    #              When match occurs, awk extracts the 2nd field i.e. rating.
    #                Assign extracted value to 'dec_rating'
    dec_rating=$(echo $user_preferences |  tr , '\n' | tr { '\n' | tr } '\n' | awk -v myvar="$movieid" -F: '$1 == myvar {print $2}' )

    #4. Raise dec_rating value to ceiling integer. 5.001 becomes 6
    ceil_rating=$(perl -w -e "use POSIX; print ceil($dec_rating/1.0), qq{\n}")

    #4a. Limit max-rating to 5 and minimum to 1
    lt_rating=$(echo $ceil_rating | awk '$0<1{$0=1}$0>5{$0=5}1')

    #5. Concatenate lt_rating to ID
    revisedline="$ID,$lt_rating"

    #5a. Append it to output file
    echo $revisedline >> $output_file
done
echo "Output in file: "$output_file

As mentioned in the bash code, you have to run the above bash-script as below and output is appended to output_file, ‘/home/ashokharnal/submitoutput.txt’

# cd to where the two files are
cat uexp.test | ./extractandwrite.sh

This finishes our work.

Advertisements

Tags: , , , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: