Using R package, recommenderlab, for predicting ratings for MovieLens data

This is a problem of predicting user ratings for various movies not rated by him. The problem is outlined at this page of ‘Kaggle in Class‘.  Data are also available there. There are two files ‘train_v2.csv’ and ‘test_v2.csv’. Sample data from ‘train_v2.csv’ file is as follows. There are 750156 lines of data.

ID,user,movie,rating
610739,3704,3784,3
324753,1924,802,3
808218,4837,1387,4
133808,867,1196,4
431858,2631,3072,5

Sample data from ‘test_v2.csv’ is as follows. User ratings for movies are not given; they are to be predicted. There are 250053 lines of data.

ID,user,movie
895537,5412,2683
899740,5440,904
55688,368,3717
63728,425,1721
822012,4942,3697
781895,4668,2011
472806,2907,173

This prediction can be made either using matrix-factorization or item-based collaborative filtering or user-based collaborative filtering techniques. We will use the latter methods in this blog. An explanation of item based collaborative filtering may be seen here. We will use the R-package: recommenderlab. recommenderlab package uses a data-structure ratingMatrix to provide a common interface for rating data. This data structure implements many of the methods of matrix structure in R such as: dim(), dimnames(), colCounts(), rowCounts(), colMeans(), rowMeans() , colSums() and rowSums(). Further, method sample() can be used to sample data row-wise. The complete R-code is as below. The code is highly commented for easy understanding.

#### Kaggle in Class Problem. ....
####Reference: https://inclass.kaggle.com/c/predict-movie-ratings

# Set data path as per your data file (for example: "c://abc//" )
setwd("/home/ashokharnal/Documents/")

# If not installed, first install following three packages in R
library(recommenderlab)
library(reshape2)
library(ggplot2)
# Read training file along with header
tr<-read.csv("train_v2.csv",header=TRUE)
# Just look at first few lines of this file
head(tr)
# Remove 'id' column. We do not need it
tr<-tr[,-c(1)]
# Check, if removed
tr[tr$user==1,]
# Using acast to convert above data as follows:
#       m1  m2   m3   m4
# u1    3   4    2    5
# u2    1   6    5
# u3    4   4    2    5
g<-acast(tr, user ~ movie)
# Check the class of g
class(g)

# Convert it as a matrix
R<-as.matrix(g)

# Convert R into realRatingMatrix data structure
#   realRatingMatrix is a recommenderlab sparse-matrix like data-structure
r <- as(R, "realRatingMatrix")
r

# view r in other possible ways
as(r, "list")	  # A list
as(r, "matrix")   # A sparse matrix

# I can turn it into data-frame
head(as(r, "data.frame"))

# normalize the rating matrix
r_m <- normalize(r)
r_m
as(r_m, "list")

# Draw an image plot of raw-ratings & normalized ratings
#  A column represents one specific movie and ratings by users
#   are shaded.
#   Note that some items are always rated 'black' by most users
#    while some items are not rated by many users
#     On the other hand a few users always give high ratings
#      as in some cases a series of black dots cut across items
image(r, main = "Raw Ratings")       
image(r_m, main = "Normalized Ratings")

# Can also turn the matrix into a 0-1 binary matrix
r_b <- binarize(r, minRating=1)
as(r_b, "matrix")

# Create a recommender object (model)
#   Run anyone of the following four code lines.
#     Do not run all four
#       They pertain to four different algorithms.
#        UBCF: User-based collaborative filtering
#        IBCF: Item-based collaborative filtering
#      Parameter 'method' decides similarity measure
#        Cosine or Jaccard
rec=Recommender(r[1:nrow(r)],method="UBCF", param=list(normalize = "Z-score",method="Cosine",nn=5, minRating=1))
rec=Recommender(r[1:nrow(r)],method="UBCF", param=list(normalize = "Z-score",method="Jaccard",nn=5, minRating=1))
rec=Recommender(r[1:nrow(r)],method="IBCF", param=list(normalize = "Z-score",method="Jaccard",minRating=1))
rec=Recommender(r[1:nrow(r)],method="POPULAR")

# Depending upon your selection, examine what you got
print(rec)
names(getModel(rec))
getModel(rec)$nn

############Create predictions#############################
# This prediction does not predict movie ratings for test.
#   But it fills up the user 'X' item matrix so that
#    for any userid and movieid, I can find predicted rating
#     dim(r) shows there are 6040 users (rows)
#      'type' parameter decides whether you want ratings or top-n items
#         get top-10 recommendations for a user, as:
#             predict(rec, r[1:nrow(r)], type="topNList", n=10)
recom <- predict(rec, r[1:nrow(r)], type="ratings")
recom

########## Examination of model & experimentation  #############
########## This section can be skipped #########################

# Convert prediction into list, user-wise
as(recom, "list")
# Study and Compare the following:
as(r, "matrix")     # Has lots of NAs. 'r' is the original matrix
as(recom, "matrix") # Is full of ratings. NAs disappear
as(recom, "matrix")[,1:10] # Show ratings for all users for items 1 to 10
as(recom, "matrix")[5,3]   # Rating for user 5 for item at index 3
as.integer(as(recom, "matrix")[5,3]) # Just get the integer value
as.integer(round(as(recom, "matrix")[6039,8])) # Just get the correct integer value
as.integer(round(as(recom, "matrix")[368,3717])) 

# Convert all your recommendations to list structure
rec_list<-as(recom,"list")
head(summary(rec_list))
# Access this list. User 2, item at index 2
rec_list[[2]][2]
# Convert to data frame all recommendations for user 1
u1<-as.data.frame(rec_list[[1]])
attributes(u1)
class(u1)
# Create a column by name of id in data frame u1 and populate it with row names
u1$id<-row.names(u1)
# Check movie ratings are in column 1 of u1
u1
# Now access movie ratings in column 1 for u1
u1[u1$id==3952,1]

########## Create submission File from model #######################
# Read test file
test<-read.csv("test_v2.csv",header=TRUE)
head(test)
# Get ratings list
rec_list<-as(recom,"list")
head(summary(rec_list))
ratings<-NULL
# For all lines in test file, one by one
for ( u in 1:length(test[,2]))
{
   # Read userid and movieid from columns 2 and 3 of test data
   userid <- test[u,2]
   movieid<-test[u,3]

   # Get as list & then convert to data frame all recommendations for user: userid
   u1<-as.data.frame(rec_list[[userid]])
   # Create a (second column) column-id in the data-frame u1 and populate it with row-names
   # Remember (or check) that rownames of u1 contain are by movie-ids
   # We use row.names() function
   u1$id<-row.names(u1)
   # Now access movie ratings in column 1 of u1
   x= u1[u1$id==movieid,1]
   # print(u)
   # print(length(x))
   # If no ratings were found, assign 0. You could also
   #   assign user-average
   if (length(x)==0)
   {
     ratings[u] <- 0
   }
   else
   {
     ratings[u] <-x
   }

}
length(ratings)
tx<-cbind(test[,1],round(ratings))
# Write to a csv file: submitfile.csv in your folder
write.table(tx,file="submitfile.csv",row.names=FALSE,col.names=FALSE,sep=',')
# Submit now this csv file to kaggle
########################################

Incidentally, the R-package, recommenderlab, will also build models with large datasets.

Advertisements

Tags: , , , , ,

14 Responses to “Using R package, recommenderlab, for predicting ratings for MovieLens data”

  1. Deepu Says:

    What does nn parameter below do? Is it number of nearest neighbours to be considered?

    Recommender(r[1:nrow(r)],method=”UBCF”, param=list(normalize = “Z-score”,method=”Jaccard”,nn=5, minRating=1)

  2. Ahmad Says:

    Great post.

    When I tried to apply the tutorial on my own data set that has a user_id, business_id, and rating, I faced the following error when I tried to use the acast() function from the reshape2 package:

    > g dim(ratings)
    [1] 1569264 3

    Any idea on how I can transform my densed-structure data set to the structure required by the recommenderlab package (realRatingMatrix) without using acast?

    These are the first lines of my ratings data set:

    > head(ratings)
    user_id business_id rating
    1 Xqd0DzHaiyRqVH3WRG7hzg vcNAWiLM4dR7D2nwwJ7nCA 5
    2 H1kH6QZV7Le4zqTRNxoZow vcNAWiLM4dR7D2nwwJ7nCA 2
    3 zvJCcrpm2yOZrxKffwGQLA vcNAWiLM4dR7D2nwwJ7nCA 4
    4 KBLW4wJA_fwoWmMhiHRVOA vcNAWiLM4dR7D2nwwJ7nCA 4
    5 zvJCcrpm2yOZrxKffwGQLA vcNAWiLM4dR7D2nwwJ7nCA 4
    6 Qrs3EICADUKNFoUq2iHStA vcNAWiLM4dR7D2nwwJ7nCA 1

    Thanks
    Ahmad

  3. selmi Says:

    Hello Ahmed can you print a sample of your data in order to help you . try this : str(your data) and head(your data )

  4. 推薦システムに関する参考情報 | かものはしの分析ブログ Says:

    […] RecommendarLab ・レコメンデーションアルゴリズム ・レコメンデーションの評価 ・レコメンド用データの整形 ・サンプルデータセットもある(5000人分のジョークに対する評価) Using R package, recommenderlab, for predicting ratings for MovieLens data […]

  5. Naga Lakshmipathi Anantha Says:

    when i execute the following command, i am getting the following output. But in the material mentioned above
    as(recom, “matrix”) # Is full of ratings. NAs disappear
    as(recom, “matrix”)
    [1,] NA NA NA
    [2,] NA NA NA
    [3,] NA NA NA
    [4,] NA NA NA
    [5,] NA NA 557.5
    [6,] NA NA NA

    i am getting like this. Could you please help me?

  6. Kartik Says:

    Did u try creating a training and test data set with CV techniques and hold out values for checking the performance. How did you check the model performance before submitting the model

  7. Maya Says:

    Hello,

    I am trying to test this with your sample data but I get this error:

    Error in as(recom, “matrix”)[, 1:10] : subscript out of bounds

    Can you help me?

  8. siddzz223 Says:

    You might not have 10 columns in your rating matrix. Like in case of movies you might have less than 10 movies in your dataset.

  9. ezgi Says:

    hey, im trying to this line:

    g<-acast(tr, user ~ movie)

    and i get that error:
    Using rating as value column: use value.var to override.

    what should i do?

  10. cemuney Says:

    g<-acast(tr, user ~ movie)
    Same error?
    "Using rating as value column: use value.var to override."

  11. gaurav (@gaurav67890_) Says:

    Hey ! thanks for the great script…I am unable to predict my data..when i write recom <- predict(rec, r[1:nrow(r)], type="ratings") its says matrix is too large exceeding over 11gb..What should I do?Even though total components in my matrix is just 4800000.

  12. valerio Says:

    I run it all and it works perfectly! Thank you very much! Thanks to the developer I got a fantastic mark in the project of data mining at my university. Couldn’t be more grateful

  13. Raul Says:

    I am a bit stumped here with one of the data manipulation steps. I see that you have normalized the data and stored it in “r_m” variable (line 44). However, as I move down in the code, I expected that the normalized data (r_m) would be used while building the recommender system (line 70), but, I see that you have continued to use “r” variable which basically stores the non-normalized data along with a parameter list which states normalize = “Z-Score”.

    Two questions:
    1. Why did you normalize the data (line 44) when it (r_m) is not being used further down in your code?

    2. Any specific reason that you are considering “Z-score” normalization (line 70) and not going with normalization by subtracting “mean” from each data point – which I guess is the default option (I may be wrong here)? And if this step takes care of normalization of your data “r”, then why have “r_m” in the first place?

    Would help if you could throw some light on it!

    Thanks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: