Alternating Least Square Matrix factorization for Recommender engine using Mahout on hadoop–Part I

December 29, 2014

For a conceptual explanation of how matrix factorization applies to solving recommender problems, please see my earlier blog. Mahout offers Alternate Least Square algorithm for matrix factorization. User-item matrix, V, can be factorized into one, m X k, user-feature matrix, U, and an n X k item-feature matrix, M. Mahout ALS recommender engine can be used for large data sets that spread over many machines in hdfs environment. It thus enjoys advantage over techniques that require that all data fit into RAM of a single machine.

Mahout’s ALS recommender has the further advantage of making recommendations when user-item preferences are not explicit but implicit. In an explicit feedback environment, a user explicitly states his rating for an item. In an implicit feedback preference environment, user’s preference for an item is gathered when, for example, he shows an interest in an item by clicking relevant links or when he purchases an item or when he searches for an item. In an explicit feedback, rating scale can have any range, say, 1 to 5. In implicit feedback rating is either 0 or 1; maximum rating is 1.

For our experiments, we will use the MovieLens data set ml-100k. The data set consists of 100,000 ratings from 943 users for 1682 movies on a scale of 1-5. Each user has rated at least 20 movies. We will work with ‘u1.base’ file in this data set. File, u1.base, has data in the following tab-separated format:

1	1	5	874965758
1	2	3	876893171
1	3	4	878542960
2	1	4	888550871
2	10	2	888551853
2	14	4	888551853
2	19	3	888550871
3	317	2	889237482
3	318	4	889237482
3	319	2	889237026
4	210	3	892003374
4	258	5	892001374
4	271	4	892001690

The first column is userid, second column is movieid, the third is rating and the last is the time stamp. We are not concerned with time stamp here and hence we will remove this column. The following awk script will do this for us:

awk '{print $1"\t"$2"\t"$3}' u1.base > uexp.base

We will upload user-movie rating file ‘uexp.base’ to hadoop and then use mahout to first build user-item rating matrix and then factorize it. ‘mahout parallelALS’ can be used for matrix factorization. It uses ALS technique. To see what all are the arguments to ‘mahout parallelALS’, on a command line, issue the command:

mahout  parallelALS  –h

In our case, we have installed Cloudera hadoop eco-system on our machine. Mahout gets installed automatically with Cloudera. Of course, some initial configuration does need to be made. But we assume that mahout is configured and working on your system.

The following code snippet builds user-item matrix and factorizes it. The code is highly commented to make it understand.

#!/bin/bash
# We will use u1.base file which contains ratings data

#######Some constants##########
# Folder in local file system where user-rating file, u1.base, is placed
datadir="/home/ashokharnal/Documents/datasets/Movielens"
localfile=$datadir/"uexp.base"

# Folder in hadoop-filesystem to store user-rating file
ddir="/user/ashokharnal"
hdfs_movie_file=$ddir/"uexp.base"
# Folder where calculated factor matrices will be placed
out_folder=$ddir/"uexp.out"
# Temporary working folder in hdfs
temp="/tmp/itemRatings"
cd $datadir

# Remove time-stamps from user-rating file 
awk '{print $1"\t"$2"\t"$3}' u1.base > uexp.base

# Copy rating file from local filesystem to hadoop
hdfs dfs -put $localfile $ddir/
# Check if file-copying successful
hdfs dfs -ls $ddir/uexp.base

# Delete temporary working folder in hadoop, if it already exists
hdfs dfs -rm -r -f $temp

# Build model from command line.
mahout parallelALS --input $hdfs_movie_file \
                   --output $out_folder \
                   --lambda 0.1 \
                   --implicitFeedback false \
                   --alpha 0.8 \
                   --numFeatures 15 \
                   --numIterations 100 \
                   --tempDir $temp

##Parameters in mahout are
# lambda            Regularization parameter to avoid overfitting
# implicitFeedback  User's preference is implied by his purchases (true) else false
# alpha             How confident are we of implicit feedback (used only when implicitFeedback is true)
#                   Of course, in the above case it is 'false' and hence alpha is not used.
# numFeatures       No of user and item features
# numIterations     No of iterations 
# tempDir           Location of temporary working files 

The above script writes a user-feature matrix under folder (on hadoop), /user/ashokharnal/uexp.out/U/, an item-feature matrix under folder, /user/ashokharnal/uexp.out/M/. Matrix files under folders, U and M are in (binary) sequence file format. There is another folder /user/ashokharnal/uexp.out/userRatings/. Sequence files in this folder contain already known user ratings for various movieids.

We can now use user-feature matrix and item-feature matrix to calculate for any user top-N recommendations. For this purpose userids need to be stored in sequence file format where userid is the key and movieid (which will be ignored) is the value. We write the following text file:

1,6
2,13
3,245
4,50
46,682

In this file userids 1, 2, 3, 4, 46 are of interest while movieids written against them are arbitrary numbers. We are interested in finding out top-N recommendations for these userids. We convert this file to a sequence file using following Java code. There is presently no command line tool available from mahout to produce such file. The code is highly commented for ease of understanding. You can use NetBeans IDE to write, build and run this code. See this link.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.VectorWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.mahout.math.Vector;

public class VectorsFileForRecommender
	{
	public static void main(String[] args) throws IOException
		{
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		String input;
                String output;
                String line;
                input = "/home/ashokharnal/keyvalue.txt";
                output = "/home/ashokharnal/part-0000";

                // Create a file reader object
		BufferedReader reader = new BufferedReader(new FileReader(input));
                // Create a SequenceFile writer object
		SequenceFile.Writer writer = new SequenceFile.Writer( fs,conf,new Path(output), IntWritable.class, VectorWritable.class);
		
                // Read lines of input files, one record at a time
		while ((line = reader.readLine()) != null)
                    {
                    String[] rec;                           // A string array. 
                    rec = line.split(",");                  // Split line at comma delimiter and fill the string array 
                    double[] d = new double[rec.length];    // A double array of dimensaion rec.length
                    d[0] = Double.parseDouble(rec[0]);      // Double conversion needed for creating vector 
                    d[1] = Double.parseDouble(rec[1]);

                    // We will print, per record, lots of outputs to bring clarity 
                    System.out.println("------------------");
                    System.out.println("rec array length: "+rec.length);
                    System.out.println("userid: "+rec[0]);
                    System.out.println("Movieid: "+Double.parseDouble(rec[1]));
                  
                    // Create a Random access sparse vector. A random access sparse vector
                    //  only stores non-zero values and any value can be accessed randomly
                    //    as against sequential access.
                    // Class, RandomAccessSparseVector, implements vector to 
                    //   store non-zero values of type doubles
                    // We may either create a RandomAccessSparseVector object of size just 1, as:
                    //   Vector vec = new RandomAccessSparseVector(1);
                    // Or, create a RandomAccessSparseVector of size 2, as:
                    Vector vec = new RandomAccessSparseVector(rec.length);
                    
                    // Method ,assign(), applies the function to each element of the receiver
                    //   If RandomAccessSparseVector size is just 1, we may assign to it
                    //      either userid or movied. For example: vec.assign(d[1]) ;. 
                    // Argument to assign() can only be a double array or a a double value
                    //   or a vector but not integer or text.
                    vec.assign(d);      // Assign a double array to vector
                 
                    // Prepare for writing the vector in sequence file
                    //    Create an object of class VectorWritable and set its value
                    VectorWritable writable = new VectorWritable();
                    writable.set(vec);
                                        
                    // Check vector size
                    System.out.println("Vector size: "+ vec.size());
                    // Check vector value
                    System.out.println("Vector value: "+ vec.toString());
                    // Check what is actually being written to sequence file
                    System.out.println("Vector value being written: "+writable.toString());
                    System.out.println("Key value being written: "+d[0]);
                    
                    // Mahout sequence file for 'recommendfactorized' requires that key be of class IntWritable
                    //   and value which is ignored be of class VectorWritable.
                    // Append now line-input to sequence file in either way:
                    //   writer.append( new IntWritable(Integer.valueOf(rec[0])) , writable);
                    //    OR
                    writer.append( new IntWritable( (int) d[0]) , writable);
                    // Note: As value part of sequencefile is ignored, we could have written just
                    //       any arbitrary number to it instead of rec-array.
                    }
                writer.close();
                }
        }

This code produces a sequence file /home/ashokharnal/part-0000 on local file system. Copy this to hadoop to, say, /user/ashokharnal/part-0000. You can dump the content of this sequence file to a text file in your local file system using the following mahout command:

mahout seqdumper -i /user/ashokharnal/part-0000   -o /home/ashokharnal/dump.txt

File, dump.txt, has the following contents:

[ashokharnal@master ~]$ cat dump.txt
Input Path: /user/ashokharnal/part-0000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable
Key: 1: Value: {1:6.0,0:1.0}
Key: 2: Value: {1:13.0,0:2.0}
Key: 3: Value: {1:245.0,0:3.0}
Key: 4: Value: {1:50.0,0:4.0}
Key: 46: Value: {1:682.0,0:46.0}
Count: 5

As mentioned above, key is in ‘IntWritable’ class while value part in curly brackets is in ‘VectorWritable’ class. That key be in ‘IntWritable’ class and value be in ‘VectorWritable’ class is a requirement for our next step. We now use ‘mahout recommendfactorized‘ command to make top-N recommendations for specified users. The script is as follows:

# This script is in continuation of earlier bash script
#  Dollar constants used are as in the earlier bash script
mahout recommendfactorized \
       --input $ddir/part-0000  \
       --userFeatures $out_folder/U/ \
       --itemFeatures $out_folder/M/ \
       --numRecommendations 15 \
       --output /tmp/topNrecommendations \
       --maxRating 5
       
##Parameters
# input                   Sequence file with userids
# userFeatures            user-feature matrix
# itemFeatures            item-features matrix
# numRecommendations      top-N recommendations to make per user listed in the input file
# output                  top-N recommendations in descending order of importance
# maxRating               Maximum possible rating for any item

You can now open the text file under the hadoop folder ‘/tmp/topNrecommendations’. The top-N recommendations appear as follows:

1   [1449:5.0,119:4.8454704,169:4.837305,408:4.768415,474:4.710996,50:4.6945467,1142:4.692111,694:4.646718,127:4.614179,174:4.6061845,513:4.6046524,178:4.6008058,483:4.5722823,1122:4.5680165,12:4.5675783]
2	[1449:5.0,1512:4.9012074,1194:4.900322,1193:4.751229,242:4.7209125,178:4.717094,318:4.702426,661:4.700798,427:4.696252,302:4.6929398,357:4.6668787,1064:4.642539,603:4.6239715,98:4.5959983,694:4.5866184]
3	[902:5.0,320:4.716057,865:4.540695,1143:4.36101,340:4.310355,896:4.307236,179:4.195304,1368:4.194901,512:4.1928997,345:4.169366,1642:4.136083,445:4.094295,321:4.0908194,133:4.085527,423:4.06355]
4	[251:5.0,253:5.0,1368:5.0,190:5.0,1137:5.0,320:5.0,223:5.0,100:5.0,1449:5.0,1005:5.0,1396:5.0,10:5.0,1466:5.0,1099:5.0,1642:5.0]
46	[958:5.0,512:5.0,1449:5.0,1159:5.0,753:5.0,170:5.0,1642:5.0,408:5.0,1062:5.0,114:5.0,745:5.0,921:5.0,793:5.0,515:5.0,169:4.9880037]

In fact, if you are interested, you can build top-N recommendations for all users in one go using user-ratings (sequence) file (containing already known user-ratings as in file uexp.base) under folder ‘userRatings’. Recall that this folder was created by the earlier mahout script. The following is the mahout script.

mahout recommendfactorized \
       --input /user/ashokharnal/uexp.out/userRatings/  \
       --userFeatures $out_folder/U/ \
       --itemFeatures $out_folder/M/ \
       --numRecommendations 15 \
       --output /tmp/topNrecommendations \
       --maxRating 5

While we now know the top-N recommendations for a user, what if we were to find rating for a movie not in the top-N. In the next blog we will cover this aspect i.e. find predicted user-rating for any (userid,movieid).

That finishes this. Enjoy!

Matrix Factorization for Recommender engine–A simple explanation

December 28, 2014

Techniques of matrix-factorization are used for discovering preference of a customer regarding items for which he has not expressed any preference so far. Such discovery of preference helps e-commerce sites (or news-portal sites) to recommend to customers items (or news) that he could consider purchasing (reading). e-commerce sites hosts millions of items and out of these items, a customer’s preference is known only for very few of such items. Thus, if a user-item preference (or rating) matrix is constructed, most of the ratings in this matrix would be 0 (or NA i.e. not-available). A matrix such as this is known a sparse matrix. As most of the cells in this matrix are zero, to save space, many differing techniques are adopted to create sparse matrix. For example, if per row (i.e. per user), preference is expressed for say 100 items in columns of million items, then instead of storing 1 million integer values per row, one could just mention the rowno, colno and value of these 100 preferences, as below. Rest are implied to be of zero value.

[row_no_1  {(col_no, col_value), (col_no,col_value)....}]
[row_no_2 {(col_no, col_value), (col_no,col_value)....}]
[row_no_n  {(col_no, col_value), (col_no,col_value)....}]

In this way, per row, say, for 100 preferences, one may have to store not more than 420 symbols including brackets etc instead of a million (zero) numbers. The matrix, now, not only occupies less space but also and possibly the whole of matrix can be brought into RAM for further processing. Many ingenious ways have been devised to store sparse matrices. Wikipedia can be a good starting point for understanding them. Many of these storage techniques also have accompanying APIs to carry out common matrix operations directly on them such as matrix multiplication, transpose etc. Thus, a number of matrix operations can be speeded up using sparse matrices.

A user-item sparse rating matrix, V, can, in turn, be conceived of the result of multiplication of two matrices, one a user-feature preference matrix and the other item-feature matrix. In a user-feature matrix of mobile, for example, there may be such features as color, weight etc. Same features may be there in an item-feature matrix. Item-feature matrix, may indicate for an item either presence or absence of a feature or its relative ‘level’ with respect to the value of same feature in an other similar item. Below, we have three tables. One, a user-feature matrix (4 X 4) at lower left, second, an item-feature matrix (4 X 4) at top-right and the third matrix on the bottom-right is the product of the two matrices. No-Entry, Dabbang and Gunday are names of three Indian movies.

Tables: 1 to 3

No-Entry Dabbang Gunday
Comedy 1 0 0
horror 0 1 1
adult 0 0 0
general 1 1 1
comedy horror adult general No-Entry Dabbang Gunday
user1 4 3 1 3 user1 7 6 6
user2 5 3 3 2 user2 7 5 5
user3 3 2 1 1 user3 4 3 3
user4 2 1 2 2 user4 4 3 3

You may recall that the matrix-multiplication of two matrices A and B (A.B) is done as follows:

The value in the first cell is calculated as the dot product of first row of A with first column of B; the value in the second cell on the right is the dot-product of first row of A with the second column of B and so on. Thus, the first cell values will be: 4 X 1+3 X 0 + 1 X 0 +3 X 1= 7 and the second cell-value is: 4 X 0 + 3X 1 + 1 X 0 + 1 X 3 = 6. Thus, each cell value is the dot-product of corresponding user-row with item-column. In general a product-matrix is calculated through a set of equations, as:

a11*b11 + a12 * b21 + a13 * b31 + a14 * b41 =7
a21*b12 + a22 * b22 + a23 * b32 + a24 * b42 =6
|
|

where a11, a12, a13 and a14 are column-wise coefficients of first row of matrix A and b11, b12, b13 and b14 are row-wise coefficients of first column of matrix B. Please refer to Wikipedia for further details.

In reverse, given just the product (generally called matrix V), it might be possible to solve the above set of equations (sixteen equations, one for each cell, in our case) for values of a11, a12, b11, b12 and so on i.e. be able to find the two multiplicand matrices, A and B.

Given a user-item rating matrix, we presume that there exists a (latent, not disclosed to us) user-feature matrix and an item-feature matrix. We do not know what those latent features are and how many of them are there; all that we know is that features do exist. But why are we interested in factorizing a sparse user-item feature matrix into two user-item and feature-item matrices? Consider the following (not so) sparse initial user-item matrix. Our interest is to find user preference even for those items where it is not known (represented by 0).

Table: 4 Initial user-item rating matrix

No-Entry Dabbang Gunday
user1 5 0 2
user2 4 0 0
user3 0 2 0
user4 7 0 4

Manually, by trial and error, we try to factorize it into two matrices. The following is the result (ignore a bit of sparseness in factors). User-feature matrix (U) is as below. We do not know the names of these features. Therefore, we just name them f1-f4.

Table: 5 Derived user-feature matrix

f1 f2 f3 f4
user1 2 1 2 0
user2 1 1 1 1
user3 0 1 0 1
user4 2 1 2 2

And item-feature matrix (M) of:

Table: 6 Derived item-feature matrix

No-Entry Dabbang Gunday
f1 1 0 0
f2 1 1 0
f3 1 0 1
f4 1 1 1

If we multiply the two matrix-factors in Table 5 and Table 6, we get the following user-item matrix:

Table: 7 Derived user-item matrix

No-Entry Dabbang Gunday
user1 5 1 2
user2 4 2 2
user3 2 2 1
user4 7 3 4

Compare it with Table 4. We are able to fill in the blanks and get to know user’s preference even for those cases where his preference was not earlier known. Note that in this particular case the starting point, user-item matrix (Table 4), was not so sparse and enough information was able to factorize. Factors, on multiplication, yielded no error for already known ratings (bold numbers in Table 7). In real world, we are not so lucky. The initial user-item matrix is very very sparse. And effort is to factorize so that the root-mean square error (RMSE) for initial known ratings is minimized. There are many ways to achieve this minimum and hence we have a number of algorithms for matrix factorization.

But how did we know there would be four features? Well we did not know. We could have guessed five features and the result would have been as below (again ignore a bit of sparseness in factors). Note that the derived user-item ratings matrix is now different from the one at Table 7 above even though RMSE is still zero.

Tables: 8-10: Factorization with five features

 

No Entry Dabbang Gunday
f1 1 0 0
f2 1 1 0
f3 1 0 1
f4 1 1 1
f5 1 0 1
f1 f2 f3 f4 f5 No Entry Dabbang Gunday
user1 2 1 1 0 1 user1 5 1 2
user2 1 1 1 1 0 user2 4 2 2
user3 1 1 1 1 1 user3 3 2 3
user4 2 1 2 2 0 user4 7 3 4

So how many features do we select? In practice, the training data is divided into two parts. Training data and test data. Training data is used to build a number of user-item matrices with varying number of features. For each result, RMSE is calculated for user-items ratings given in test data. That model is selected for which this RMSE is minimum.

The initial sparse user-item matrix in Table 4 can also be factored in the following way (again ignore a bit of sparseness in factors):

No Entry Dabbang Gunday
f1 1 0 0
f2 1 1 0
f3 1 0 1
f4 1 1 1
f5 2 0 1
f1 f2 f3 f4 f5 No Entry Dabbang Gunday
user1 2 -1 1 -1 2 user1 5 0 2
user2 1 1 2 2 -1 user2 4 3 3
user3 1 -2 0 4 1 user3 5 2 5
user4 2 1 2 2 0 user4 7 3 4

But this time we have negative preferences in user-feature matrix. Sparser the matrix, more the possible number of factors. Generally we go for Non-negative matrix factorization (NMF) rather than for factors with negative elements in them.  Also numerical algorithms rather than equation-solving is used in matrix factorization. Hence the factorization is generally approximate rather than exact with non-negativity imposing one constraint on factorization.

But why at all do we interpret factors of user-item matrix as user-feature matrix and item-feature matrix? Maybe what we call user-feature matrix is in fact, user-user (some-relationship) matrix and so also for the other factor! We will explain this intuitively. Let us assume that they are really so and we will show that this assumption does help us understand calculation of a user’s rating by matrix multiplication. If V is m X n user-item matrix, W is m X k  user-feature matrix and M is n X k item-feature matrix, then:

V = W . transpose(M)
or,        V =W.H
where, H = transpose(M)

k is the number of features. Generally k is much less than either m, the number of users, or n, the number of items. For example, in a 943 X 1063 user-item matrix, we may have two non-negative factors of size 943 X 15 and 1063 X 15. The original matrix had 943 * 1063 = 1002409 elements while the factors have in all: 943*15 + 1063*15 = 30090 elements; this is about 30 times less than the original matrix. Factors, therefore, compress information.

Further, now every user’s preference for each feature is represented row-wise in matrix W. On the other hand, how much of each feature is contained in a particular item is represented column wise in H. To determine a user’s rating for an item, we multiply the specific row of the user with the particular column of the item or in other words user’s preference for a feature is weighted by how of that feature is contained in that item and the result is summed up over all features.  It is this intuitive interpretation of NMF that makes it useful for treating the two factors of V, W and H as being user-(hidden) feature matrix and item-(hidden) feature matrix rather than being something else.  W is appropriately called ‘basis’ matrix while H is called ‘coefficient’ matrix. The matrix:

Vi = W.hi

is a one-column list of respective ratings of all users for one item ‘i’ represented by (coefficient) column hi of H.

In the next blog I will explain how to factorize user-item rating matrix using Alternating Least Square method. We will use Mahout for the purpose.

Using R package, recommenderlab, for predicting ratings for MovieLens data

December 18, 2014

This is a problem of predicting user ratings for various movies not rated by him. The problem is outlined at this page of ‘Kaggle in Class‘.  Data are also available there. There are two files ‘train_v2.csv’ and ‘test_v2.csv’. Sample data from ‘train_v2.csv’ file is as follows. There are 750156 lines of data.

ID,user,movie,rating
610739,3704,3784,3
324753,1924,802,3
808218,4837,1387,4
133808,867,1196,4
431858,2631,3072,5

Sample data from ‘test_v2.csv’ is as follows. User ratings for movies are not given; they are to be predicted. There are 250053 lines of data.

ID,user,movie
895537,5412,2683
899740,5440,904
55688,368,3717
63728,425,1721
822012,4942,3697
781895,4668,2011
472806,2907,173

This prediction can be made either using matrix-factorization or item-based collaborative filtering or user-based collaborative filtering techniques. We will use the latter methods in this blog. An explanation of item based collaborative filtering may be seen here. We will use the R-package: recommenderlab. recommenderlab package uses a data-structure ratingMatrix to provide a common interface for rating data. This data structure implements many of the methods of matrix structure in R such as: dim(), dimnames(), colCounts(), rowCounts(), colMeans(), rowMeans() , colSums() and rowSums(). Further, method sample() can be used to sample data row-wise. The complete R-code is as below. The code is highly commented for easy understanding.

#### Kaggle in Class Problem. ....
####Reference: https://inclass.kaggle.com/c/predict-movie-ratings

# Set data path as per your data file (for example: "c://abc//" )
setwd("/home/ashokharnal/Documents/")

# If not installed, first install following three packages in R
library(recommenderlab)
library(reshape2)
library(ggplot2)
# Read training file along with header
tr<-read.csv("train_v2.csv",header=TRUE)
# Just look at first few lines of this file
head(tr)
# Remove 'id' column. We do not need it
tr<-tr[,-c(1)]
# Check, if removed
tr[tr$user==1,]
# Using acast to convert above data as follows:
#       m1  m2   m3   m4
# u1    3   4    2    5
# u2    1   6    5
# u3    4   4    2    5
g<-acast(tr, user ~ movie)
# Check the class of g
class(g)

# Convert it as a matrix
R<-as.matrix(g)

# Convert R into realRatingMatrix data structure
#   realRatingMatrix is a recommenderlab sparse-matrix like data-structure
r <- as(R, "realRatingMatrix")
r

# view r in other possible ways
as(r, "list")	  # A list
as(r, "matrix")   # A sparse matrix

# I can turn it into data-frame
head(as(r, "data.frame"))

# normalize the rating matrix
r_m <- normalize(r)
r_m
as(r_m, "list")

# Draw an image plot of raw-ratings & normalized ratings
#  A column represents one specific movie and ratings by users
#   are shaded.
#   Note that some items are always rated 'black' by most users
#    while some items are not rated by many users
#     On the other hand a few users always give high ratings
#      as in some cases a series of black dots cut across items
image(r, main = "Raw Ratings")       
image(r_m, main = "Normalized Ratings")

# Can also turn the matrix into a 0-1 binary matrix
r_b <- binarize(r, minRating=1)
as(r_b, "matrix")

# Create a recommender object (model)
#   Run anyone of the following four code lines.
#     Do not run all four
#       They pertain to four different algorithms.
#        UBCF: User-based collaborative filtering
#        IBCF: Item-based collaborative filtering
#      Parameter 'method' decides similarity measure
#        Cosine or Jaccard
rec=Recommender(r[1:nrow(r)],method="UBCF", param=list(normalize = "Z-score",method="Cosine",nn=5, minRating=1))
rec=Recommender(r[1:nrow(r)],method="UBCF", param=list(normalize = "Z-score",method="Jaccard",nn=5, minRating=1))
rec=Recommender(r[1:nrow(r)],method="IBCF", param=list(normalize = "Z-score",method="Jaccard",minRating=1))
rec=Recommender(r[1:nrow(r)],method="POPULAR")

# Depending upon your selection, examine what you got
print(rec)
names(getModel(rec))
getModel(rec)$nn

############Create predictions#############################
# This prediction does not predict movie ratings for test.
#   But it fills up the user 'X' item matrix so that
#    for any userid and movieid, I can find predicted rating
#     dim(r) shows there are 6040 users (rows)
#      'type' parameter decides whether you want ratings or top-n items
#         get top-10 recommendations for a user, as:
#             predict(rec, r[1:nrow(r)], type="topNList", n=10)
recom <- predict(rec, r[1:nrow(r)], type="ratings")
recom

########## Examination of model & experimentation  #############
########## This section can be skipped #########################

# Convert prediction into list, user-wise
as(recom, "list")
# Study and Compare the following:
as(r, "matrix")     # Has lots of NAs. 'r' is the original matrix
as(recom, "matrix") # Is full of ratings. NAs disappear
as(recom, "matrix")[,1:10] # Show ratings for all users for items 1 to 10
as(recom, "matrix")[5,3]   # Rating for user 5 for item at index 3
as.integer(as(recom, "matrix")[5,3]) # Just get the integer value
as.integer(round(as(recom, "matrix")[6039,8])) # Just get the correct integer value
as.integer(round(as(recom, "matrix")[368,3717])) 

# Convert all your recommendations to list structure
rec_list<-as(recom,"list")
head(summary(rec_list))
# Access this list. User 2, item at index 2
rec_list[[2]][2]
# Convert to data frame all recommendations for user 1
u1<-as.data.frame(rec_list[[1]])
attributes(u1)
class(u1)
# Create a column by name of id in data frame u1 and populate it with row names
u1$id<-row.names(u1)
# Check movie ratings are in column 1 of u1
u1
# Now access movie ratings in column 1 for u1
u1[u1$id==3952,1]

########## Create submission File from model #######################
# Read test file
test<-read.csv("test_v2.csv",header=TRUE)
head(test)
# Get ratings list
rec_list<-as(recom,"list")
head(summary(rec_list))
ratings<-NULL
# For all lines in test file, one by one
for ( u in 1:length(test[,2]))
{
   # Read userid and movieid from columns 2 and 3 of test data
   userid <- test[u,2]
   movieid<-test[u,3]

   # Get as list & then convert to data frame all recommendations for user: userid
   u1<-as.data.frame(rec_list[[userid]])
   # Create a (second column) column-id in the data-frame u1 and populate it with row-names
   # Remember (or check) that rownames of u1 contain are by movie-ids
   # We use row.names() function
   u1$id<-row.names(u1)
   # Now access movie ratings in column 1 of u1
   x= u1[u1$id==movieid,1]
   # print(u)
   # print(length(x))
   # If no ratings were found, assign 0. You could also
   #   assign user-average
   if (length(x)==0)
   {
     ratings[u] <- 0
   }
   else
   {
     ratings[u] <-x
   }

}
length(ratings)
tx<-cbind(test[,1],round(ratings))
# Write to a csv file: submitfile.csv in your folder
write.table(tx,file="submitfile.csv",row.names=FALSE,col.names=FALSE,sep=',')
# Submit now this csv file to kaggle
########################################

Incidentally, the R-package, recommenderlab, will also build models with large datasets.

Worked out example–Item based Collaborative filtering for Recommender Engine

December 18, 2014

Item based collaborative filtering is a model-based algorithm for recommender engines. In item based collaborative filtering similarities between items are calculated from rating-matrix. And based upon these similarities, user’s preference for an item not rated by him is calculated. Here is a step-by-step worked out example for four users and three items. We will consider the following sample data of  preference of four users for three items:

ID user item rating
241 u1 m1 2
222 u1 m3 3
276 u2 m1 5
273 u2 m2 2
200 u3 m1 3
229 u3 m2 3
231 u3 m3 1
239 u4 m2 2
286 u4 m3 2

Step 1: Write the user-item ratings data in a matrix form. The above table gets rewritten as follows:

m1 m2 m3
u1 2 ? 3
u2 5 2 ?
u3 3 3 1
u4 ? 2 2

Here rating of user u1 for item m3 is 3. There is no rating for item m2 by user u1. And no rating also for item m3 by user u2.

Step 2: We will now create an item-to-item similarity matrix. The idea is to calculate how similar an item is to another item. There are a number of ways of calculating this. We will use cosine similarity measure.  To calculate similarity between items m1 and m2, for example, look at all those users who have rated both these items. In our case, both m1 and m2 have been rated by users u2 and u3. We create two item-vectors, v1 for item m1 and v2 for item m2, in the user-space of (u2,u3) and then find the cosine of angle between these vectors. A zero angle or overlapping vectors with cosine value of 1 means total similarity (or per user, across all items, there is same rating) and an angle of 90 degree would mean cosine of 0 or no similarity. Thus, the two item-vectors would be,

            v1 = 5 u2 + 3 u3
            v2 = 3 u2 + 3 u3

The cosine similarity between the two vectors, v1 and v2, would then be:

             cos(v1,v2) = (5*3 + 3*3)/sqrt[(25 + 9)*(9+9)] = 0.76

Similarly, to calculate similarity between m1 and m3, we consider only users u1 and u3 who have rated both these items. The two item vectors, v1 for item m1 and v3 for item m3, in the user-space would be as follows:

             v1 = 2 u1 + 3 u3
             v3 = 3 u1 + 1 u3

The cosine similarity measure between v1 and v3 is:

             cos(v1,v3) = (2*3 + 3*1)/sqrt[(4 + 9)*(9+1)] = 0.78

We can similarly calculate similarity between items m2 and m3 using ratings given to both by users u3 and u4. The two item-vectors v3 and v4 would be:

             v2 = 3 u3 + 2 u4
             v3 = 1 u3 + 2 u4

And cosine similarity between them is:

             cos(v2,v3) = (3*1 + 2*2)/sqrt[(9 + 4)*(1 + 4)] = 0.86

We now have the complete item-to-item similarity matrix as follows:

m1 m2 m3
m1 1 0.76 0.78
m2 0.76 1 0.86
m3 0.78 0.86 1

Step 3: For each user, we next predict his ratings for items that he had not rated. We will calculate rating for user u1 in the case of item m2 (target item). To calculate this we weigh the just-calculated similarity-measure between the target item and other items that user has already rated. The weighing factor is the ratings given by the user to items already rated by him. We further scale this weighted sum with the sum of similarity-measures so that the calculated rating remains within a predefined limits. Thus, the predicted rating for item m2 for user u1 would be calculated using similarity measures between (m2,m1) and (m2,m3) weighted by the respective ratings for m1 and m3:

                rating = (2 * 0.76 + 3 * 0.86)/(0.76+0.86) = 2.53

Recommender engine using item based collaborative filtering can be constructed using R package recommenderlab. See my blog here.

References:
1. Item-based collaborative filtering
2. recommenderlab: A Framework for Developing and Testing Recommendation Algorithms

Connecting MS Access to Cloudera Impala through ODBC connection

July 27, 2014

MS Access can be connected to Cloudera Impala through ODBC Connector. Our version of Impala is ver 1.3 and that of ODBC Connector is ver 2.5.5.1005 (64 bit). ODBC Connector (msi file) can be downloaded from Cloudera Impala site from here. Download and install ODBC Connector as usual on your Windows machine.

After installation, click Start->All Programs  to look for ‘Cloudera ODBC Driver for Impala‘ folder. Within it, click ‘64-bit ODBC Administrator‘ link to open ‘ODBC Data Source Administrator‘ window. Click on the tab ‘User DSN’. Next, click Add button to open ‘Create New Data Source’ window. Here, you should see among other ODBC connectors,  ‘Cloudera ODBC Driver for Impala‘. Select it and click Finish button.  A configuration window will open. Fill up some (any) name for the ‘Data Source Name‘ and write some Description. Against Host, write the address of your Cloudera server machine. Leave the default values in other text boxes. Click Test.. button to test the connection. Click OK to close the DSN setup.

Figure 1: Configuring User DSN in Impala ODBC Connector

Figure 1: Configuring User DSN in Impala ODBC Connector

Start MS Access. Create a new ‘Blank Database‘ as usual. MS Access opens with one default table Table 1. Click on ‘External Data–>ODBC Database‘.  In the Get External Data wizard, select Link to the data source by creating a linked table. As generally data hosted on hadoop is huge, import of source data into a new table in MS Access is not possible. Default size of a table in MS Access is just 2GB. Linking is the only option. Hopefully data extracted from queries will be of much less size and can be accommodated in an MS Access database. Click OK. In the Select Data Source dialog box, click the tab ‘Machine Data Source‘ and select the data source you just created above. Click OK.

Figure 2: Select the data source you just created above and click OK.

Figure 2: Select the data source you just created above and click OK.

You will be presented with all the tables from Impala. Select as many as you want. MS Access will show them as linked tables (see figure below).

Figure 3: Walmart transaction table from Impala.

Figure 3: Walmart transaction table from Impala. Click to enlarge image.

You can now query Impala from MS Access as usual. You should also be able to create a .Net program from MS Access as usual.

 

Cloudera machine, increasing datanode capacity and also resizing logical volumes

June 21, 2014

On a single machine with Cloudera in pseudo distributed mode, often while experimenting a need is felt a) to increase datanode capacity and, b) to increase size of a partition as hadoop is slow in releasing deleted files space.

On my machine, Cloudera has used up most of / (root) partition. There was a need to increase both the space for datanode and also size of /  (root) partition. I had sufficient space on /home partition. My partitions were logical volumes so lvm utilities were used to increase/decrease partition size. This is how I proceeded.

a. Increasing datanode space.
As root, create some folder. I created, /home/hdfs/dfs. Use chown command to change ownership of hdfs and dfs folders in favour of superuser ‘hdfs’, as:

#  chown -R hdfs:hdfs /home/hdfs

Login to Cloudera manager as ‘admin’. Click on ‘hdfs–>Configuration–>View and Edit‘. In the search box type ‘dfs.datanode.data.dir’. By default, /dfs/dn is the hadoop storage directory. Hit the + sign, next to it, a text-box opens. Write there the name of just created folder, /home/hdfs/dfs, and Save Changes. This part of the task is now finished.

b. Reducing  /home partition

First, make yourself familiar with the names of your logical volumes and unused space available. This is my machine’s status:

# df -h
Filesystem     Size     Used    Avail   Use%   Mounted on
/dev/mapper/VolGroup-lv_root
               50G      45G      3.9G   93%    /
tmpfs          7.8G     76K      7.8G   1%     /dev/shm
/dev/sda1      485M     98M      363M   22%    /boot
/dev/mapper/VolGroup-lv_home
               402G     35G      347G   10%    /home
cm_processes   7.8G     0        7.8G   0%     /var/run/cloudera-scm-agent/process

My / (root) partition is named as lv_root. It is terribly short of space with only 3.9G available. On the other hand, /home partition has 347G available. Its name is lv_home. Both are on Volume Group, VolGroup. We will transfer 100G from /home to /. Just to be clearer, you can issue lvs command.

# lvs
LV            VG        Attr         LSize
lv_home    VolGroup    -wi-ao----    407.50g
lv_root    VolGroup    -wi-ao----    50.00g
lv_swap    VolGroup    -wi-ao----    7.77g
(We have removed some columns with blank values)

This information is the same as above but more concise.

Now, proceed as follows: Use Cloudera Manager to stop the cluster, i.e. all services within the cluster. Next, as root, issue the following two commands so that cloudera server does not restart on reboot:

# chkconfig cloudera-scm-server off
# chkconfig cloudera-scm-server-db off

As root, edit, /etc/inittab file so that on next reboot, the system restarts in Level 3 (text) run mode.

Look for the following line:
id:5:initdefault:
And replace 5 with 3 as,
id:3:initdefault:

Reboot your system. It will start in text mode. Log in as root and issue, following three (umount, e2fsck and resize2fs) commands, one by one. The order of commands is very important:


Unmount home partition
# umount /home
Check lv_home filesystem for any errors
# e2fsck -f /dev/VolGroup/lv_home
e2fsck 1.41.12 (17-May-2010)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/VolGroup/lv_home: 10530/26705920 files (0.3% non-contiguous), 10687950/106822656 blocks

Resize lv_home filesystem from 407G to 300G
# resize2fs /dev/VolGroup/lv_home 300G
resize2fs 1.41.12 (17-May-2010)
Resizing the filesystem on /dev/VolGroup/lv_home to 78643200 (4k) blocks.
The filesystem on /dev/VolGroup/lv_home is now 78643200 blocks long.

Reduce the size of lv_home logical partition. This command must follow, resize2fs command. Note that while resize2fs is filesystem (ext2, ext3, ext4) resizing utility, lvreduce is partition resizing utility. A filesystem comprises of data and data structures. On the other hand, a disk partition divides a hard-disk into logical structures taking into account the disk geometry. One partition can be treated as an independent hard-drive. Data regarding this structure is kept in master boot record. In our case, since we are reducing partition size, filesystem size must be first reduced followed by partition size reduction.


# lvreduce  -L -100G  /dev/VolGroup/lv_home
WARNING: Reducing active logical volume to 307.50 GiB
THIS MAY DESTROY YOUR DATA (filesystem etc.)
Do you really want to reduce lv_home? [y/n]: y
Reducing logical volume lv_home to 307.50 GiB
Logical volume lv_home successfully resized

Remount home partition, and check.
# mount /home
# ls /home
ashokharnal  dfs  hdfs  lost+found

c. Increasing / (root) partition

We will now extend the root (/) partition space using lvextend command. In this case, we have to first increase our partition size and thereafter increase filesystem size to fill what is available. Also root (/) partition cannot be unmounted. We have to work on mounted partition only.


Use lvextend command to extend logical volume (partition).
# lvextend  -L  +100G  /dev/VolGroup/lv_root
Extending logical volume lv_root to 150.00 GiB
Logical volume lv_root successfully resized

Next, use resize2fs command to resize (increase) / filesystem.  as below. We will not specify the size parameter and the default is the size of partition.


# resize2fs /dev/VolGroup/lv_root
resize2fs 1.41.12 (17-May-2010)
Filesystem at /dev/VolGroup/lv_root is mounted on /; on-line resizing required
old desc_blocks = 4, new_desc_blocks = 10
Performing an on-line resize of /dev/VolGroup/lv_root to 39321600 (4k) blocks.
The filesystem on /dev/VolGroup/lv_root is now 39321600 blocks long.

Check the filesystem space,

# df -h
Filesystem            Size      Used   Avail    Use%    Mounted on
/dev/mapper/VolGroup-lv_root
                      148G      45G    102G     31%        /
tmpfs                 7.8G      224K   7.8G     1%         /dev/shm
/dev/sda1             485M      98M    363M     22%        /boot
/dev/mapper/VolGroup-lv_home
                      296G      35G    246G    13%        /home
cm_processes          7.8G     0       7.8G     0%        /var/run/cloudera-scm-agent/process

We are finished. Edit, /etc/inittab file, and change ‘3’ to ‘5’. To restart Cloudera Manager on reboot, issue chkconfig commands as:


# chkconfig cloudera-scm-server on
# chkconfig cloudera-scm-server-db on

Reboot your machine, restart the hadoop cluster through Cloudera Manager and enjoy your work!

Setting up NetBeans for Mahout on Cloudera (CentOS system)

June 16, 2014

This is a simple way to set up NetBeans to write Java programs for Apache Mahout. We will not use maven. On my machine Cloudera 5.x is installed through parcels. But steps are the same if you have another version of Cloudera installed with packages/parcels. These steps will also work if you have downloaded and installed mahout distribution (say, mahout-distribution-0.9.tar.gz) directly from Apache Mahout download site instead of installing it from cloudera repository.

1. Download and install NetBeans from here. Download full version, if you are in doubt. You need to have jdk 7 already installed. Else, you can download a composite package of jdk 7 with NetBeans from here and install both together.  You can install NetBeans as a local user or as root; it is up to you. If you install as root, it will be made available to all users.

2. In your machine, locate mahout jar files. Use ‘locate’ command for the purpose, as below. In our case, the answer to locate command shows that mahout jar files are in the folder  /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/mahout. (If you have installed mahout directly by expanding mahout distribution then location of mahout libraries will be under the expanded tar/gz folder.)

[ashokharnal@master ~]$ locate -r mahout-core-.*job.jar
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/mahout/mahout-core-0.8-cdh5.0.0-job.jar
[ashokharnal@master ~]$

3. Start NetBeans (Applications–>Programming–>NetBeans IDE) and on the NetBeans menu, click Tools–>Libraries.  Click on the button ‘New Library’ to add a new library.

Image

Figure 1: ‘New Library’ dialogue box of NetBeans. Name of library is unimportant. Click image to enlarge.

This will create an empty library. To add jar files, select ‘mahout’  and click the button Add Jar/Folder. You will have to pick up jar files from three folders, one inside the other. First, reach out to folder where jar file, mahout-core-0.8*job.jar, is located. This folder is ‘mahout’. On my machine this folder appears as below:

Figure 2: mahout folder with jar files on my machine with Clouder 5.x. Note the 'lib' folder. This folder also contains plenty of jar files. And there is a folder hadoop under lib folder that also contains jar file(s).

Figure 2: ‘mahout’ folder with jar files on my machine with Cloudera 5.x. Note also the ‘lib’ folder. lib folder contains plenty of jar files. And there is also a folder ‘hadoop’ under lib folder.  It also contains jar file(s). Click image to enlarge.

Add all jar files from this folder to just created ‘netbeans mahout library’. There will be (around) six jar files.

Selecting six jar files from under mahout folder for adding to 'nebeans mahout library'.

Figure 3: Selecting six jar files from under mahout folder for adding to ‘nebeans mahout library’. Click image to enlarge.

The added jar files will appear as in Figure 4.

Figure 3: Six jar files added in NetBeans newly created mahout library.

Figure 4: Six jar files added to NetBeans newly created mahout library. Click image to enlarge.

Click again Add Jar/Folder button, and open the ‘lib’ folder under the mahout folder from where you just picked up the six mahout jar files. See Figure 2 above.  ‘lib’ folder will also contain one ‘hadoop’ folder and plenty of jar files. Add all jar files to netbeans mahout library. Next, add the jar file(s) available under ‘hadoop’ folder to this library (we add all jar files just to be on the safe side).

Figure All jar files added to NetBeans newly create library.

Figure 5: All jar files added to NetBeans newly create library. Click image to enlarge.

Click OK button and close the Ant Library Manager. We are now ready to write a mahout program. In NetBeans, click File–>New Project and in the dialogue box, select Java->Java Application as below. Click Next.

Figure Create a Java Project (Java Application).

Figure 6: Create a Java Application Project. Click Next. Click image to enlarge.

In the next window, name your Java Project. We will call it RecommenderIntro. The name of Java class, will also be, by default, the name of project. Click Finish button.

Figure Creating RecommenderIntro project.

Figure 7: Creating RecommenderIntro project. Click image  to enlarge.

In the RecommenderIntro.java tab, replace all its contents with the following code (copy and paste). The following program has been taken from the Manning book: Mahout in Action.


import java.io.File;
import java.util.List;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
public class RecommenderIntro
    {
    public static void main(String[] args) throws Exception
        {
    
        DataModel model = new FileDataModel( new File( "/home/ashokharnal/Downloads/input.csv" ) );
        UserSimilarity similarity = new PearsonCorrelationSimilarity( model );
        UserNeighborhood neighborhood = new NearestNUserNeighborhood( 2, similarity, model );
        Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity );
        List recommendations = recommender.recommend(1, 1);
        for (Object recommendation : recommendations)
            {
            System.out.println(recommendation);
             }
        }
    
    }

At line 16 of the above code, data file, ‘input.csv’ is required to be specified with full path. The file is on local file system and not hadoop. The data in ‘input.csv’ is as below. You can copy and paste it in your text editor to create the data file.

1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0

We need to add netbeans mahout library to this project. For that purpose, under Projects, expand RecommenderIntro, right-click on Libraries and click on Add Library. From the Add Library dialogue box select mahout and we are ready to compile and run the project (see figure 8 below).

Figure: Adding netbeans mahout library to our project.

Figure 8: Adding netbeans mahout library to our project. Click image to enlarge.

Click Run–>Clean and Build the Project and then press F6 to run the project. You should get the following result:
RecommendedItem[item:104, value:4.257081]

Recommender System using RapidMiner

April 28, 2014

Recommender Systems can be broadly classified as Content based recommender systems as against recommender systems using Collaborative Filtering. Collaborative Filtering based recommender systems can be further classified as Memory based CF and Model based CF. Memory based Collaborative Filtering systems can be sub-classified as User-based CF and Item-based CF. Among each of the user-based CF and item-based CF there can be further sub-classifications depending upon the similarity measure selected, Pearson or Cosine. In both user-based CFs and item-based CFs, the attempt is to predict user’s ratings for items that he has not rated and then suggest to him the item(s) with the highest rating(s).

Data for Collaborative filtering

Plenty of public data sets for testing recommender engines are available. A list is available at this site. We will use the zipped data set (ml-1m) from GroupLens Research Project (http://www.grouplens.org/). This file contains 1, 000,209 movie ratings for 3952 movies made by 6040 users. Ratings are on a scale of 1 to 5 (whole-star ratings only). There are three files: ‘ratings.dat’, ‘users.dat’ and ‘movies.dat’. File ‘ratings.dat’ contains the user ratings in the format: UserID::MovieID::Rating::Timestamp. Each user has at least 20 ratings. As the number of movies are 3952 (much more than 20 normally-rated/per user), surely ratings by users are sparse.

User information is in file ‘users.dat’ and is in the format: UserID::Gender::Age::Occupation::Zipcode. Field information pertaining to ‘users.dat’ file is given in README file. File, movies.dat, contains movie information in the format: MovieID::Title::Genres. ‘Titles’ are identical to titles provided by the IMDB (including year of release) and ‘Genres’ are pipe-separated and are selected from the following genres: Action, Adventure, Animation, Children’s, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western.

For prediction of user ratings using collaborative filtering we need data with three fields: userid, movieid and ratings. This data is available in file ‘ratings.dat’. For the purposes of Collaborative Filtering experiments we do not need ‘Timestamp’ column.

RapidMiner extension for Recommender engine

RapidMiner extension for recommender engine has been created by e-LICO, an e-Laboratory for Interdisciplinary Collaborative Research in Data Mining and Data-Intensive Science. It can be downloaded from here after registering at RapidMiner Marketplace. It is available under AGPL license. Download three jar files: rmx_irbrecommender-ANY-5.1.1, rmx_community-ANY-5.3.0, and rmx_dmassist-ANY-5.1.2. Place these jar files under the folder rapidminer/lib/plugins/ and restart RapidMiner. The extensions do work with version 5.3.015 of RapidMiner Studio. Check the existence of following three types of operators in RapidMiner by searching for ‘recommender’.

Figure 1: Three broad types of recommender systems installed

Figure 1: Three broad types of recommender systems installed in RapidMiner

We will be using operators pertaining to Collaborative Filtering under  ‘Item Rating Prediction’.

We need to convert ‘ratings.dat’ file in appropriate data format. In Windows, open this file in Excel (in the Text Import Wizard, against ‘Other’, write ‘:’ and check-mark ‘Treat consecutive delimiters as one’). After importing delete the ‘Timestamp’ column and save the file back as CSV (comma separated) format. Open this csv file in ‘WordPad’ and replace all commas in the file with single space. Save the file as ‘rev_ratings.dat’ file (again as Plain text document) where all three numbers on a line are separated by single space.

In Linux, just run the following single line command in shell:

awk 'BEGIN {FS="::"} ; {print $1 " " $2 " " $3}' ratings.dat > rev_ratings.dat

The file rev_ratings.dat contains all the one lakh records. We need to take out a sample of records from this file to test the model. We use the following script file to create a sample file where lines are picked up randomly (without replacement) from ‘rev_ratings.dat’. The program creates two files: ‘training.dat’ and ‘tovalidate.dat’.

#!/bin/bash
#
# Generates test-sample file to test recommender engine.
# The script randomly picks up n-lines (n to be specified) from the
# given file. It creates two files: training.dat and tovalidate.dat
# The original file remains undisturbed.
#
# Get your home folder
cd ~
homefolder=`pwd`
# Your data folder & data file
datadir="$homefolder/Documents/data_analysis/rapidminer"
originalfile="rev_ratings.dat"
samplefile="tovalidate.dat"

cd $datadir
# Delete earlier sample file & recreate one of zero byte
rm -f $datadir/$samplefile
touch $datadir/$samplefile
# Delete training file and recreate one of zero byte
rm -f $datadir/training.dat
cp $datadir/$originalfile  $datadir/training.dat
# Delete temp file, if it exists
rm -f temp.txt
# Get number of lines in given file
nooflines=`sed -n '$=' $datadir/$originalfile`

echo "No of lines in $datadir/training.dat  are: $nooflines"
echo -n "Your sample size (recommended 10% of orig file)? " ; read samplesize

# Default is 50
if [ -z $samplesize ] ; then
    echo "Default value of sample size = 10"
    samplesize=10
fi

# Bash loop to generate random numbers
echo "Will generate random numbers between 2 to $samplesize"
echo "Original file size is $nooflines lines"
echo "Wait....."
for (( i = 1 ; i <= $samplesize ; ++i ));
    do
        arr[i]=$(( ( RANDOM % $nooflines )  + 2 ));
        lineno="${arr[i]}"
        # Extract and append line to sample file
        sed -n "${lineno}p" $datadir/training.dat >> $datadir/$samplefile
        # Delete the same line from training.dat
        sed "${lineno}d" $datadir/training.dat > temp.txt
        mv temp.txt $datadir/training.dat
    done
trlines=`sed -n '$=' $datadir/training.dat`
samplines=`sed -n '$=' $datadir/$samplefile`

# Delete temp file
rm -f temp.txt

echo "---------------------------------"    
echo "Lines in sample file $samplefile: $samplines"
echo "Lines in training file training.dat : $trlines" ;
echo "Data folder: $datadir"
echo "---------------------------------"

Creating and reading Attribute Description file (AML files)

We will now create two attribute description files to describe the attributes of our two ‘.dat’ files. The attribute description file has extension .aml and is in a standard XML format. The standard XML format for ‘.aml’ file is given below. The outermost tag must be <attributeset> tag. The immediate inner tag is <attribute> tag. There can be any numbers of <attribute> tags. There can also be <label> tag instead of <attribute> tag if the role of that attribute is that of class-label. For more details on .aml file format see this link. Our two attribute description files, movie_train.aml and movie_test.aml, are as below:

<?xml version="1.0" encoding="windows-1252"?>
<attributeset default_source="training.dat"></pre>
<pre>    <attribute
        name = "user_id"
        sourcecol = "1"
        valuetype = "integer"/>
    <attribute
        name = "item_id"
        sourcecol = "2"
        valuetype = "integer"/>
    <attribute
        name = "rating"
        sourcecol = "3"
        valuetype = "real"/>
</attributeset>

The second AML file that refers to data in tovalidate.dat file is:

<?xml version="1.0" encoding="windows-1252"?>
<attributeset default_source="tovalidate.dat">
    <attribute
        name = "user_id"
        sourcecol = "1"
        valuetype = "integer"/>
    <attribute
        name = "item_id"
        sourcecol = "2"
        valuetype = "integer"/>
    <attribute
        name = "rating"
        sourcecol = "3"
        valuetype = "real"/>
</attributeset>

And a sample of data in training.dat file is as below:

123 2011 4
42 377 4
163 3268 1
136 1225 4
187 2610 5
175 2590 5
189 1717 4
173 2323 5

Construct Workflow in RapidMiner
In the RapidMiner, drag Read AML operator twice to workspace. Then, in the workspace, click on the Read AML operator and in the Parameters panel, click the grid icon against the attributes parameter (see red arrow in the figure).

Reading attribute description files. Click on the grid icon next to attributes parameter.

Figure 2: Reading attribute description files. Click on the grid icon next to attributes parameter. Click to see larger image.

This will open up Attribute Editor window. Again click on the icon with the + (plus) sign (see red line in the figure below) and browse to the ‘movie.aml’ file in your file-system. The data from file ‘training.dat’ is pulled up and is shown in the Attribute window. Close the Attribute Editor. Similarly use the second Read AML operator to open ‘movie_test.aml’ and associated tovalidate.dat file. You can connect Read AML’s output (Output or out ports) of both operators to results (res) ports at the right wall of workspace. Click Run icon (Process–>Run or press  F11) to observe the file contents. Read AML operator reads the data file at the time the process runs and nothing is imported beforehand into repository of RapidMiner. Thus, each time before the process is re-run, data in the dat files can be changed.

Click on the grid icon with + sign and browse to .aml file.

Figure 3: Click on the grid icon with + sign (red arrow) and browse to .aml file to open it. Click to enlarge image.

We now have to set roles for the three fields. Roles will be as follows:

rating     :  label
user_id  : user identification
item_id : item identification

Note that the roles for user_id and item_id are little unusual but the extension requires these roles. Drag Set Role operator twice from the Operators panel and place them in the workspace as shown in the figure. Click on a Set Role operator and then click on Edit List (see figure).

2

Figure 4: Drag ‘Set Role’ from operators window into workspace. Click on ‘Edit List’ to set additional roles. Click to enlarge image.

The Edit Parameter List window is shown in the following figure. For typing out roles ‘user identification’ and ‘item identification’ each key has to be pressed twice otherwise it does not get typed properly. Click Apply to return back to Workspace.  Do this for the other Set Role operator also.

Assigning roles to userid (as 'user identification'), itemid (as 'item identification') and ratings (as 'label').

Figure 5: Assigning roles to userid (as ‘user identification’), itemid (as ‘item identification’) and ratings (as ‘label’). Click to enlarge image.

Drag to workspace, operator Item k-NN, from under Recommender Systems–>Item Rating Prediction.  Drag also Apply Model (Rating Prediction)  and Performance (Rating Prediction) from Operators window. Make workflow connections as shown below.

The complete model with item k-NN as model builder. We have here the default k=80.

Figure 6: The complete model with Item k-NN as model builder. We have here the default k=80. Click to enlarge image.

Run the model. Performance in terms of RMSE, MAE and Normalized Mean Absolute Error (NMAE) are shown.

Performance of the model.

Figure 7: Performance of the model.

Finding user ratings for un-rated movies

If you are satisfied with the model, you can proceed to run with the real data i.e. data for that user  whose ratings for some movies we want to predict. We will edit our ratings.dat and tovalidate.dat to do this. Our tovalidate.dat file now contains user ratings for user 1 only (records for other users are deleted); it no longer contains the randomly selected records. And for this user, for movies with IDs   150, 1, 1961, 1962, 260, 1029, 1207, 2028, 531, 3114, 608 and 1246 we delete (only) the ratings (replacing them with blanks). Our, tovalidate.dat file, therefore, contains just the following data. The highlighted records have no ratings.

1 1193 5
1 661 3
1 3408 4
1 2355 5
1 1197 3
1 1287 5
1 2804 5
1 919 4
1 595 5
1 938 4
1 2398 4
1 2918 4
1 2791 4
1 2687 3
1 2018 4
1 3105 5
1 2797 4
1 720 3
1 48 5
1 1097 4
1 1721 4
1 1545 4
1 745 3
1 2294 4
1 3186 4
1 588 4
1 1907 4
1 783 4
1 1836 5
1 1022 5
1 2762 4
1 150 5
1 1 5
1 1961 5
1 1962 4
1 260 4
1 1029 5
1 1207 4
1 2028 5
1 531 4
1 3114 4
1 608 4
1 1246 4
1 1193 5
1 661 3
1 3408 4
1 2355 5
1 1197 3
1 1287 5
1 2804 5
1 919 4
1 595 5
1 938 4
1 2398 4
1 2918 4
1 2791 4
1 2687 3
1 2018 4
1 3105 5
1 2797 4
1 720 3
1 48 5
1 1097 4
1 1721 4
1 1545 4
1 745 3
1 2294 4
1 3186 4
1 588 4
1 1907 4
1 783 4
1 1836 5
1 1022 5
1 2762 4
1 150
1 1
1 1961
1 1962
1 260
1 1029
1 1207
1 2028
1 531
1 3114
1 608
1 1246

File, ratings.dat, has also been edited by us. Twelve records pertaining to user ‘1’ have been deleted. This file contains for user 1 only those records for which his ratings are available; i.e. we delete from here 12 records for movies with ID 150, 1, 1961, 1962, 260, 1029, 1207, 2028, 531, 3114, 608 and 1246. Data about all other users remains in this file as it was. We then create the following workflow in RapidMiner and run it.

Figure 8: Workflow to find user ratings for movies for which no ratings have been given by the user. Click to enlarge image.

The results of this workflow can be seen in the Figure 9. Predictions have been made for user 1 for movies for which he had given no ratings.

Ratings have now been predicted for movies that user 1 had not predicted.

Figure 9: Ratings have now been predicted for movies that user 1 had not rated (Rows 75 to 86).

Parameter optimization

The algorithm for k-NN has a number of parameters and prominent among them being k. Is there a way to automatically test for various Ks and decide the best k? Yes, there is.  Rebuild the RapidMiner workflow as shown in the next three diagrams using Optimize Parameters (Grid) operator. There are three Optimize Parameter operators available in RapidMiner. We will only use the Grid operator. You are free to use other two and especially Optimize Parameters (Evolutionary).

Use the Optimize Parameter operator to find out best k in Item k-NN.

Figure 10: Use the Optimize Parameter (Grid) operator to find out that k in Item k-NN that will improve the performance most. Click to enlarge image.

Create the workflow as shown in Figure 10 (above). Next, click the button ‘Edit Parameter Settings.. ‘ (see red arrow) and in the window that opens (see below), within Operators panel  three operators appear. Here, select the first operator, Item K-NN. On selection, in the middle panel, Parameters, all parameters of Item k-NN then get displayed. Select, k, and click on the button with forward arrow to take k to ‘Selected Parameters’ panel.  Under the Grid/Range text box set range for k (Min=1, Max=100 and Steps=10) and then click OK button.

Parameter k of Item k-NN is allowed to vary from k=1 to 100.

Figure 11. Parameter k of Item k-NN is allowed to vary from k=1 to 100. Click to enlarge image.

Back in the Workspace, double click the lower-right overlapping rectangles icon in the Optimize Parameters operator (see the red circle in Figure 10). It opens up Optimization Process as shown in the figure 12 (below). Construct the workflow for the process to be optimized.

Within the Optimize Parameter operator, build

Figure 12: Within the Optimize Parameter operator, construct the workflow as shown here. Run the process (F11).

Run the complete process (press F11). The optimization process runs and after some time gives the following  results:

Result after process optimization. k = 21.

Figure 13: Result after process optimization. Best performance is achieved with k=21.

Ensemble of models

A number of item ratings algorithms can be connected in parallel and their output combined in a weighted fashion to create a model for recommender. For the purpose, we have to use two more operators: A process operator, Multiply, and a model combiner operator, Model Combiner (Rating Prediction). The workflow is shown below.

Using process Multiply operator and Model Combiner (Rating Prediction) operator to apply multiple models to a set of data.

Figure 14: Using process operator, Multiply, and operator, Model Combiner (Rating Prediction) to apply multiple models to a set of data. Click to enlarge image.

On running this multiple model process, with default parameters, there is a slight improvement in results. The results are as below.

The performance of multiple models. The result is better than a single model.

Figure 15: The performance of multiple models. The result is better than a single model.

Optimizing Parameters within ensemble

As before, we can optimize the parameters here also within the ensemble using the Optimize Parameter (Grid) operator. The workflow within the optimization process window is shown below.

Optimizing parameters in an ensemble.

Figure 16: Optimizing parameters in an ensemble.

We will try to optimize the learning rate in Matrix Factorization algorithm. The results are as below with a very marginal improvement over the earlier results.

Figure 17: Optimized performance of ensemble. Optimization was done for learning rate in MF.

Simultaneous performance measurement & finding user ratings for unrated items

In RapidMiner, you can carry out steps to measure model performance as also get ratings for unrated user items simultaneously with the following workflow. We have used multiplier operator to get more than one output of Item k-NN model. One output is used to test and measure performance and the other to get ratings for unrated items for some user(s).

Figure: Measuring performance and also predicting ratings for unrated user items.

Figure 18: Measuring performance and also predicting ratings for unrated user items. Click to enlarge.

 

This completes our discussion on using RapidMiner as recommender engine.  I hope the article does prove useful.

 

Analysing MovieLens movie data using Pig — A Step by Step tutorial

March 25, 2014

We will use MovieLens dataset for analysis with Pig. The data is available from here. This dataset has been collected by GroupLens Research Project. Three datasets of varying sizes are available from the site. These are MovieLens 100K, MovieLens 1M and MovieLens 10M. The datasets contain movie ratings made by movie goers. We will use MovieLens 1M dataset. Download and unzip it. It contains three text files: ratings.dat, users.dat and movies.dat. You can view them in gedit if you have sufficient RAM. Contents of these files are described in README that accompanies the dataset. For the sake of completeness, data in the three files is briefly described here.

ratings.dat–>userid::movieid:rating::timestamp

userid ranges from 1 to 6040, movieid ranges from 1 to 3952, rating is made on a five-star discrete scale (whole-star rating only) and timestamp is in seconds.

users.dat–>userid::gender::age::occupation::zipcode

gender is M or F; age is categorized as 1, 18, 25, 35,45, 50, 56. The meanings are: 1: <18; 18: 18-24; 25: 25-34; 35: 35-44; 45: 45-49; 50: 50-55; 56 is 56+.

occupation is one of these: 0: other, 1: academic/educator ; 2: artist ; 3: clerical/admin ; 4: college/grad student ; 5: customer service ; 6: doctor/health care ; 7: executive/managerial ; 8: farmer ; 9: homemaker ; 10: K-12 student ; 11: lawyer ; 12: programmer ; 13: retired ; 14: sales/marketing ; 15: scientist ; 16: self-employed ; 17: technician/engineer ; 18: tradesman/craftsman ; 19: unemployed ; 20: writer

movies.dat–>movieID::title::genres

Titles are movie titles and genres are pipe-separated and are selected from the following genres: Action, Adventure, Animation, Children’s, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western

Field separators are double colon. We substitute them with semicolon. And then copy the whole directory to hadoop folder. The following shell script does the preparatory work of inserting semicolon as field separator and copying the whole movie folder to hadoop. Our machine has Cloudera installed.

#!/bin/bash
#
#
cd /home/ashokharnal
unzip ml-1m.zip 
mv ml-1m movie
# Unfortunately PigStorage does not work with :: as separators
# so replace :: with ;
sed 's/\:\:/;/g' users.dat > users.txt
sed 's/\:\:/;/g' movies.dat > movies.txt
sed 's/\:\:/;/g' ratings.dat > ratings.txt
mv users.txt users.dat
mv movies.txt movies.dat
mv ratings.txt ratings.dat

# Copy files to hadoop system
# Delete any earlier movie folder on hadoop
hdfs dfs -rmr /user/ashokharnal/movie
# Copy movie folder to hdfs
hdfs dfs -put movie /user/ashokharnal

Pig can be started from logged in user’s command shell as below. If you are on Cloudera system, you may have to export Classpath of hadoop libraries. After a few messages (grunts), pig shell prompt ‘grunt’ appears. We will execute pig commands in this shell.

$ export HADOOP_CLASSPATH="/usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/*:$HADOOP_CLASSPATH"
$ pig
grunt> 

Using load statement, load user.dat file into pig. In fact, nothing is loaded immediately. Only syntax is checked. Whether file exists is also not found out at this stage.

/* Load users.dat file. 'user' on the left is not a variable but an alias
for the load command. While loading, we also: <strong>a. </strong>indicate the field
separator, <strong>b.</strong> specify proposed field names, and <strong>c.</strong> specify their data-types.
The default field separator is tab. */

user = load '/user/ashokharnal/movie/users.dat' using PigStorage(';') as (userid:int, gender:chararray, age:int, occupation:int, zip:chararray) ;

/* for every row in the user, generate columnar data as follows.
This statement, in effect, projects named columns */
udata = foreach user generate userid , gender , age ,occupation , zip ;

/* Dump output to screen but limit it to five rows */
lt = limit udata 5 ;
dump lt ;

In pig, function names and alias are case sensitive. The pig latin statements are not case sensitive.
While executing load, foreach and limit statements, pig will merely check the syntax. Only when dump (or store) statement is encountered, these statements are fully executed and output produced.
Let us now get the distribution of gender in the data. We will first group by gender using group statement and then run a foreach statement (followed by dump) to see some columns. (Note: In pig a comment can be written in c-style or after two dashes (–) on a line.

gr = group user by gender ;                      --group alias 'user' by gender
co = foreach gr generate group, COUNT(user) ;    --for each line in the group (i.e. M and F)
                                                 -- count rows in 'user'
co = foreach gr generate group, COUNT_STAR(user) ; --same as above
dump co ;

Let us now count how many persons are below 34 years of age. We want to get the number of young users.

/* First write a filter statement */
young = filter user by age < 35 ; 
cm = group young all ;        
countyoung = foreach cm generate COUNT(young) ;
dump countyoung ;

In the above code, you may find the occurrence of ‘group‘ statement a little unusual. The group statement groups all fields (and hence is practically of no use). FOREACH statement will not work on a filter alias i.e. young (i.e. foreach young…). Count next how many females users between age category 35 and 50 have rated movies. We will use two filters, one after another.

/* How many are between 35 and 59 */
m_age = filter user by age >=35 AND age <= 50 ;
/* How many of these are females. Filter after filter */
m_age_female = filter m_age by gender == 'F' ; 
/* First create group to project a column */
cmg = group m_age_female by gender ;  -- group by gender, OR
cmg = group m_age_female ALL ;        -- group by ALL fields

/* Note that above group statement is NOT 'group m_age_female <em>by</em> ALL .
   Create a foreach statement and dump it */
count_m_female = foreach cm generate COUNT(m_age_female) ;  
dump count_m_female ;

We can order ‘age’ wise our filtered data using order statement.

ord = order m_age_female by age ;

NExt, we split data in various ways. We will first split it gender wise and then age wise.

/* Split udata by gender into two aliases */
split udata into female if gender == 'F', male if gender == 'M' ;
dump female ;   --see female data only
/* split data into young and middle aged females */
split udata into yfemale if gender == 'F' AND age <= 35, mfemale if ( gender == 'F' AND (age > 35 AND age <= 50 ));

dump yfemale;   -- see young females data

Sample 10% of users data for experimentation. The sample is random but no other data file is created that excludes records contained in the sample.

/* Sample 10% (0.01) of data  */
sam = sample udata 0.01 ;
dump sam;

Let us now load ‘ratings’ data and generate columns.

/* Load ratings file in pig. Also specify column separator, column names and their data types. */
ratings = load '/user/ashokharnal/movie/ratings.dat' using PigStorage(';') as (userid:int, movieid:int, ratings:int, timestamp:int) ;
/* A foreach statement to generate columns. You can dump alias, rdata */
rdata = foreach ratings generate (userid,movieid,ratings,timestamp) ;  --dump will have extra brackets (( ))
rdata = foreach ratings generate (userid,movieid,ratings,timestamp) ;  --dump will have one pair of brackets

--Check 
x = limit rdata 5 ;
dump x ;

We will now join the two relations ‘user’ and ‘ratings’ on userid. This is an inner join. We will use the join to calculate average ratings given by females and males. Are females more generous?

--Join two relations on userid.
j = join user by userid, ratings by userid ;
/* Group join by gender */
h = group j by gender;
/* Write a foreach to work group wise.
'group' stands for group-field with two values (M & F)
Note also that ratings field within AVG needs to be qualified by alias */
g = foreach h generate group, AVG(j.ratings) ;
/* The following will also produce output but massive repetition of M,M..*/
g = foreach h generate j.gender , AVG(j.ratings) ;

The result is: (F,3.6203660120110372) (M,3.5688785290984373). You can draw your own conclusions. Next let us load movies.dat file and dump a few lines for our inspection.

-- Load movies.dat 
movie = load '/user/ashokharnal/movie/movies.dat' using PigStorage(';') as (movieid:int, title:chararray, genres:chararray) ;
movielist = foreach movie generate movieid, title, genres ;
x = limit movielist 3 ;
dump x;

We will now join all the three tables: user, ratings and movies. We will join ‘movie’ with j, the earlier join of two relations, j. As this join is complicated, we can describe it to see how does it appear.

# Join the three tables
jo = join  j by movieid, movie by movieid ;
describe jo ;
jo: {movie::movieid: int,movie::title: chararray,movie::genres: chararray,j::user::userid: int,j::user::gender: chararray,j::user::age: int,j::user::occupation: int,j::user::zip: chararray,j::ratings::userid: int,j::ratings::movieid: int,j::ratings::ratings: int,j::ratings::timestamp: int}

y = limit jo 3 ;
dump jo ;
/* Project few fields of triple join. Note how field-names are accessed.
ghost = foreach jo generate j::user::userid  , j::ratings::ratings as rate , movie::genres ;
tt = limit ghost 5 ;
dump tt ;

Let us create a filter on this triple join. We are interested only in ‘Comedy’ movies. Only if the word ‘Comedy’ appears in genre string, it is of relevance to us.

-- Filter out only Comedy from genres. Note carefully the syntax
filt = filter ghost by ( movie::genres matches '.*Comedy.*' ) ;
relx = group filt by movie::genres ;
dump relx ;

As field-naming in triple join becomes a little complex, it is better to store the result in hadoop and reload the stored file in pig for further manipulation. We will do this now.

/* Store jo and reload it. This simplifies query. For storing,
in hadoop, only folder name is to be given. Pig will create the folder. */

store jo into '/user/ashokharnal/temp/myfolder' ; 

-- Reload the stored file within 'myfolder'. As userid appears twice, name it userid1 and userid2.
-- Do the same for other fields

l_again = load '/user/ashokharnal/temp/myfolder/part-r-00000' as (userid1:int, gender:chararray, age:int, occupation:int, zip:chararray, userid2:int, movieid1:int, ratings:int, timestamp:int, movieid2:int, title:chararray, genres:chararray) ;

--Check results
z = limit l_again 10 ;
dump z ;

--Create a projection of few columns for our manipulation. We will use alias 'alltable'
alltable = foreach l_again generate gender, genres, ratings ;

Let us now see whether there is any preference for comedy movies across gender.

--First filter out Comedy movies from triple join.
filt = filter alltable by ( genres matches '.*Comedy.*' )  ;

-- Group by gender 
mygr = group filt by gender ;

-- Find how gender votes for comedy movies
-- Note AVG argument should be as: filt.ratings. Just ratings raises an error.
vote = foreach mygr generate group, AVG(filt.ratings) ;
dump vote ;

The result is: (F,3.5719375512875113) (M,3.503666796000138)
Well both are about equal.

Happy pigging…

A very simple explanation for AUC or Area Under Curve (ROC Curve)

March 14, 2014

In machine-learning classification models, one common measure of model accuracy is AUC or Area Under the Curve. By curve is implied, ROC curve. ROC stands for Receiver Operating characteristic used by Radar Engineers in World War-II.

While the curve name ROC has stayed, in machine-learning it has nothing to do with radar signals or any signal receiver. It may, therefore, be better to keep the abbreviated name ROC as it is without attempting to expand it. In our explanation of AUC we will keep the technical terms to the minimum.

In an ROC curve, we plot ‘True Positives‘ on Y-axis and ‘True Negatives‘ on X-axis. This said, let us explain and consider both these terms. We have a data with 200 records. Assume that in this data, 100 records are classified as 1s and another 100 records are classified as 0s. A predictive classification model is built and a prediction for total number of 1s is made and tested. Let us say prediction is that number of 1s are 70 and rest 30 are 0s. Thus, true positive rate (TPR) is 70/100 or 0.70.
Let us now say that out of 100 records classified as 0s, prediction says that 70 of them are 1s and 30 are 0s. Therefore, false positives rate (FPR) is 70% or 0.70. (See first and third columns of the table below) What happens when FPR (False positive rate) of 0.70 is equal to TPR (True positive rate) of 0.70. As per TPR prediction we will expect 70 ones (from among hundred records actually marked as 1) and also as per FPR, we will expect another 70 ones (out of a total of 100 records actually marked as 0s) that is a total of 140 ones are predicted. Among 140 ones, 50% are correct and 50% are incorrect. Now a new record is tested. It tests as 1. There is a 50% chance that it is one among true-positives and there is another 50% chance that it is one among the false-positives.

What if TPR was 40% and FPR was also 40%. Again the same 50% chance of a predicted one (1) being from any one of the two groups. Thus, if FPR is plotted on X-axis and TPR on Y-axis, all along the diagonal we will have points where TPR and FPR are equal. Points along the diagonal represent a situation which is akin to making a prediction by tossing of an unbiased coin, that is classification model is random.

Now let us say, TPR is 70% and FPR is 40%. A one (1) is predicted for a new record. There is a 70/(70+40) = 70/110 chance of it being one from among true positive group and 40/110 chance of it being one from among the false positive group. Our classification model now shows improvement than tossing of coin. This point, (FPR, TPR)=(0.4,0.7), is above the diagonal. And certainly, this situation (classification model) is relatively still better than when the FPR/TPR combination is (0.4, 0.6). That is, more vertically above the diagonal we are placed, the better the classification model.

Consider now the opposite: a FPR/TPR point of (0.7,0.4). A one (1) is predicted. There is more chance (70/110) that it belongs to false-positive group rather than to true-positive group (40/110 chance). This situation is worse than a model with 50:50 chance of being on the diagonal. This point (0.7,0.4) is below the diagonal.

ROC curves. Green curve has better predictability than red color curve.

ROC curves. Green curve has better predictability than red color curve.

All models with points below the diagonal have worse performance than a model which makes predictions randomly.

What if our initial set of binary class contained 100 ones (1s) and 50 zeros (0s) instead of 100 ones and 100 zeros ? What would diagonal represent? In this case the diagonal would still represent points with equal TPR and FPR but the probability of a one being from the TPR group will be 2/3 and that of being from FPR group will be 1/3 all along the diagonal; this can again be simulated by a random process, say, a biased coin.

How an ROC curve is drawn: Now you have build a classification model. This model has a TPR of 0.70 and FPR of 0.30. So this model stands well above the diagonal. All binary classification models make a judgement of a record being in one class or another based on a certain threshold value of the model output. Suppose now you decrease this threshold level. Then there are chances that what were being earlier classified as zeros will be reclassified as ones. That is, there will be an increase in the total number of 1s; total number of predicted zeros will not increase though these may decrease. Meaning thereby both TPR and FPR will increase. See the following table.

Actual class Model output Predicted class (cut off 0.5) Predicted class (cut off 0.45)
1 0.6 1 1
1 0.5 1 1
1 0.5 1 1
1 0.45 0 1
1 0.44 0 0
0 0.5 1 1
0 0.5 1 1
0 0.45 0 1
0 0.45 0 1
0 0.4 0 0
TPR=3/5, FPR=2/5 TPR=4/5,FPR=4/5

TPR may increase (or remain the same) because earlier an actual one (1) could have been wrongly classified by the model as being zero (0) because of high cut-off value but with the reduction of threshold it gets correctly classified as 1 increasing the numbers of true positives. FPR will increase because an actual zero which was being correctly classified as zero, may get reclassified as 1 leading to another false positive. Both FPR and TPR increase and our point on the ROC plane moves to the right and up. Similarly, increasing threshold will decrease TPR and also FPR and point will move to down-left (in the diagram above, point ‘a’ moves to point ‘b’). An ROC curve is drawn by changing the value of this threshold. The shape of ROC curve depends upon how discriminating our model output has been.

The more an ROC curve is lifted up and away from the diagonal the better the model is. Since our X-axis scale is maximum 1 (FPR may vary from 0 to 100% or 1) and since Y-axis scale is also maximum one (TPR may vary from 0 to 100% or 1), total area under the rectangle (actually a square) is 1*1=1. The more the ROC curve is lifted up and away from the diagonal (and hugs the top-horizontal line, TPR=1), the more area will be under it and this area will be closer to 1. Thus, maximum possible area an ROC curve can have under it is 1. If the ROC curve coincides with the diagonal, area under it is 0.5 (a diagonal divides a square/rectangle in half). We have just learnt that an ROC curve along the diagonal is a random classification model. The area under (ROC) curve is known as AUC. This area, therefore, should be greater than 0.5 for a model to be acceptable; a model with AUC of 0.5 or less is worthless. Understandably, this area is a measure of predictive accuracy of model.

You may have a look at the two ROC curves in Wikipedia.


Follow

Get every new post delivered to your Inbox.

Join 275 other followers