Recommender System using RapidMiner

Recommender Systems can be broadly classified as Content based recommender systems as against recommender systems using Collaborative Filtering. Collaborative Filtering based recommender systems can be further classified as Memory based CF and Model based CF. Memory based Collaborative Filtering systems can be sub-classified as User-based CF and Item-based CF. Among each of the user-based CF and item-based CF there can be further sub-classifications depending upon the similarity measure selected, Pearson or Cosine. In both user-based CFs and item-based CFs, the attempt is to predict user’s ratings for items that he has not rated and then suggest to him the item(s) with the highest rating(s).

Data for Collaborative filtering

Plenty of public data sets for testing recommender engines are available. A list is available at this site. We will use the zipped data set (ml-1m) from GroupLens Research Project (http://www.grouplens.org/). This file contains 1, 000,209 movie ratings for 3952 movies made by 6040 users. Ratings are on a scale of 1 to 5 (whole-star ratings only). There are three files: ‘ratings.dat’, ‘users.dat’ and ‘movies.dat’. File ‘ratings.dat’ contains the user ratings in the format: UserID::MovieID::Rating::Timestamp. Each user has at least 20 ratings. As the number of movies are 3952 (much more than 20 normally-rated/per user), surely ratings by users are sparse.

User information is in file ‘users.dat’ and is in the format: UserID::Gender::Age::Occupation::Zipcode. Field information pertaining to ‘users.dat’ file is given in README file. File, movies.dat, contains movie information in the format: MovieID::Title::Genres. ‘Titles’ are identical to titles provided by the IMDB (including year of release) and ‘Genres’ are pipe-separated and are selected from the following genres: Action, Adventure, Animation, Children’s, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western.

For prediction of user ratings using collaborative filtering we need data with three fields: userid, movieid and ratings. This data is available in file ‘ratings.dat’. For the purposes of Collaborative Filtering experiments we do not need ‘Timestamp’ column.

RapidMiner extension for Recommender engine

RapidMiner extension for recommender engine has been created by e-LICO, an e-Laboratory for Interdisciplinary Collaborative Research in Data Mining and Data-Intensive Science. It can be downloaded from here after registering at RapidMiner Marketplace. It is available under AGPL license. Download three jar files: rmx_irbrecommender-ANY-5.1.1, rmx_community-ANY-5.3.0, and rmx_dmassist-ANY-5.1.2. Place these jar files under the folder rapidminer/lib/plugins/ and restart RapidMiner. The extensions do work with version 5.3.015 of RapidMiner Studio. Check the existence of following three types of operators in RapidMiner by searching for ‘recommender’.

Figure 1: Three broad types of recommender systems installed

Figure 1: Three broad types of recommender systems installed in RapidMiner

We will be using operators pertaining to Collaborative Filtering under  ‘Item Rating Prediction’.

We need to convert ‘ratings.dat’ file in appropriate data format. In Windows, open this file in Excel (in the Text Import Wizard, against ‘Other’, write ‘:’ and check-mark ‘Treat consecutive delimiters as one’). After importing delete the ‘Timestamp’ column and save the file back as CSV (comma separated) format. Open this csv file in ‘WordPad’ and replace all commas in the file with single space. Save the file as ‘rev_ratings.dat’ file (again as Plain text document) where all three numbers on a line are separated by single space.

In Linux, just run the following single line command in shell:

awk 'BEGIN {FS="::"} ; {print $1 " " $2 " " $3}' ratings.dat > rev_ratings.dat

The file rev_ratings.dat contains all the one lakh records. We need to take out a sample of records from this file to test the model. We use the following script file to create a sample file where lines are picked up randomly (without replacement) from ‘rev_ratings.dat’. The program creates two files: ‘training.dat’ and ‘tovalidate.dat’.

#!/bin/bash
#
# Generates test-sample file to test recommender engine.
# The script randomly picks up n-lines (n to be specified) from the
# given file. It creates two files: training.dat and tovalidate.dat
# The original file remains undisturbed.
#
# Get your home folder
cd ~
homefolder=`pwd`
# Your data folder & data file
datadir="$homefolder/Documents/data_analysis/rapidminer"
originalfile="rev_ratings.dat"
samplefile="tovalidate.dat"

cd $datadir
# Delete earlier sample file & recreate one of zero byte
rm -f $datadir/$samplefile
touch $datadir/$samplefile
# Delete training file and recreate one of zero byte
rm -f $datadir/training.dat
cp $datadir/$originalfile  $datadir/training.dat
# Delete temp file, if it exists
rm -f temp.txt
# Get number of lines in given file
nooflines=`sed -n '$=' $datadir/$originalfile`

echo "No of lines in $datadir/training.dat  are: $nooflines"
echo -n "Your sample size (recommended 10% of orig file)? " ; read samplesize

# Default is 50
if [ -z $samplesize ] ; then
    echo "Default value of sample size = 10"
    samplesize=10
fi

# Bash loop to generate random numbers
echo "Will generate random numbers between 2 to $samplesize"
echo "Original file size is $nooflines lines"
echo "Wait....."
for (( i = 1 ; i <= $samplesize ; ++i ));
    do
        arr[i]=$(( ( RANDOM % $nooflines )  + 2 ));
        lineno="${arr[i]}"
        # Extract and append line to sample file
        sed -n "${lineno}p" $datadir/training.dat >> $datadir/$samplefile
        # Delete the same line from training.dat
        sed "${lineno}d" $datadir/training.dat > temp.txt
        mv temp.txt $datadir/training.dat
    done
trlines=`sed -n '$=' $datadir/training.dat`
samplines=`sed -n '$=' $datadir/$samplefile`

# Delete temp file
rm -f temp.txt

echo "---------------------------------"    
echo "Lines in sample file $samplefile: $samplines"
echo "Lines in training file training.dat : $trlines" ;
echo "Data folder: $datadir"
echo "---------------------------------"

Creating and reading Attribute Description file (AML files)

We will now create two attribute description files to describe the attributes of our two ‘.dat’ files. The attribute description file has extension .aml and is in a standard XML format. The standard XML format for ‘.aml’ file is given below. The outermost tag must be <attributeset> tag. The immediate inner tag is <attribute> tag. There can be any numbers of <attribute> tags. There can also be <label> tag instead of <attribute> tag if the role of that attribute is that of class-label. For more details on .aml file format see this link. Our two attribute description files, movie_train.aml and movie_test.aml, are as below:

<?xml version="1.0" encoding="windows-1252"?>
<attributeset default_source="training.dat"></pre>
<pre>    <attribute
        name = "user_id"
        sourcecol = "1"
        valuetype = "integer"/>
    <attribute
        name = "item_id"
        sourcecol = "2"
        valuetype = "integer"/>
    <attribute
        name = "rating"
        sourcecol = "3"
        valuetype = "real"/>
</attributeset>

The second AML file that refers to data in tovalidate.dat file is:

<?xml version="1.0" encoding="windows-1252"?>
<attributeset default_source="tovalidate.dat">
    <attribute
        name = "user_id"
        sourcecol = "1"
        valuetype = "integer"/>
    <attribute
        name = "item_id"
        sourcecol = "2"
        valuetype = "integer"/>
    <attribute
        name = "rating"
        sourcecol = "3"
        valuetype = "real"/>
</attributeset>

And a sample of data in training.dat file is as below:

123 2011 4
42 377 4
163 3268 1
136 1225 4
187 2610 5
175 2590 5
189 1717 4
173 2323 5

Construct Workflow in RapidMiner
In the RapidMiner, drag Read AML operator twice to workspace. Then, in the workspace, click on the Read AML operator and in the Parameters panel, click the grid icon against the attributes parameter (see red arrow in the figure).

Reading attribute description files. Click on the grid icon next to attributes parameter.

Figure 2: Reading attribute description files. Click on the grid icon next to attributes parameter. Click to see larger image.

This will open up Attribute Editor window. Again click on the icon with the + (plus) sign (see red line in the figure below) and browse to the ‘movie.aml’ file in your file-system. The data from file ‘training.dat’ is pulled up and is shown in the Attribute window. Close the Attribute Editor. Similarly use the second Read AML operator to open ‘movie_test.aml’ and associated tovalidate.dat file. You can connect Read AML’s output (Output or out ports) of both operators to results (res) ports at the right wall of workspace. Click Run icon (Process–>Run or press  F11) to observe the file contents. Read AML operator reads the data file at the time the process runs and nothing is imported beforehand into repository of RapidMiner. Thus, each time before the process is re-run, data in the dat files can be changed.

Click on the grid icon with + sign and browse to .aml file.

Figure 3: Click on the grid icon with + sign (red arrow) and browse to .aml file to open it. Click to enlarge image.

We now have to set roles for the three fields. Roles will be as follows:

rating     :  label
user_id  : user identification
item_id : item identification

Note that the roles for user_id and item_id are little unusual but the extension requires these roles. Drag Set Role operator twice from the Operators panel and place them in the workspace as shown in the figure. Click on a Set Role operator and then click on Edit List (see figure).

2

Figure 4: Drag ‘Set Role’ from operators window into workspace. Click on ‘Edit List’ to set additional roles. Click to enlarge image.

The Edit Parameter List window is shown in the following figure. For typing out roles ‘user identification’ and ‘item identification’ each key has to be pressed twice otherwise it does not get typed properly. Click Apply to return back to Workspace.  Do this for the other Set Role operator also.

Assigning roles to userid (as 'user identification'), itemid (as 'item identification') and ratings (as 'label').

Figure 5: Assigning roles to userid (as ‘user identification’), itemid (as ‘item identification’) and ratings (as ‘label’). Click to enlarge image.

Drag to workspace, operator Item k-NN, from under Recommender Systems–>Item Rating Prediction.  Drag also Apply Model (Rating Prediction)  and Performance (Rating Prediction) from Operators window. Make workflow connections as shown below.

The complete model with item k-NN as model builder. We have here the default k=80.

Figure 6: The complete model with Item k-NN as model builder. We have here the default k=80. Click to enlarge image.

Run the model. Performance in terms of RMSE, MAE and Normalized Mean Absolute Error (NMAE) are shown.

Performance of the model.

Figure 7: Performance of the model.

Finding user ratings for un-rated movies

If you are satisfied with the model, you can proceed to run with the real data i.e. data for that user  whose ratings for some movies we want to predict. We will edit our ratings.dat and tovalidate.dat to do this. Our tovalidate.dat file now contains user ratings for user 1 only (records for other users are deleted); it no longer contains the randomly selected records. And for this user, for movies with IDs   150, 1, 1961, 1962, 260, 1029, 1207, 2028, 531, 3114, 608 and 1246 we delete (only) the ratings (replacing them with blanks). Our, tovalidate.dat file, therefore, contains just the following data. The highlighted records have no ratings.

1 1193 5
1 661 3
1 3408 4
1 2355 5
1 1197 3
1 1287 5
1 2804 5
1 919 4
1 595 5
1 938 4
1 2398 4
1 2918 4
1 2791 4
1 2687 3
1 2018 4
1 3105 5
1 2797 4
1 720 3
1 48 5
1 1097 4
1 1721 4
1 1545 4
1 745 3
1 2294 4
1 3186 4
1 588 4
1 1907 4
1 783 4
1 1836 5
1 1022 5
1 2762 4
1 150 5
1 1 5
1 1961 5
1 1962 4
1 260 4
1 1029 5
1 1207 4
1 2028 5
1 531 4
1 3114 4
1 608 4
1 1246 4
1 1193 5
1 661 3
1 3408 4
1 2355 5
1 1197 3
1 1287 5
1 2804 5
1 919 4
1 595 5
1 938 4
1 2398 4
1 2918 4
1 2791 4
1 2687 3
1 2018 4
1 3105 5
1 2797 4
1 720 3
1 48 5
1 1097 4
1 1721 4
1 1545 4
1 745 3
1 2294 4
1 3186 4
1 588 4
1 1907 4
1 783 4
1 1836 5
1 1022 5
1 2762 4
1 150
1 1
1 1961
1 1962
1 260
1 1029
1 1207
1 2028
1 531
1 3114
1 608
1 1246

File, ratings.dat, has also been edited by us. Twelve records pertaining to user ‘1’ have been deleted. This file contains for user 1 only those records for which his ratings are available; i.e. we delete from here 12 records for movies with ID 150, 1, 1961, 1962, 260, 1029, 1207, 2028, 531, 3114, 608 and 1246. Data about all other users remains in this file as it was. We then create the following workflow in RapidMiner and run it.

Figure 8: Workflow to find user ratings for movies for which no ratings have been given by the user. Click to enlarge image.

The results of this workflow can be seen in the Figure 9. Predictions have been made for user 1 for movies for which he had given no ratings.

Ratings have now been predicted for movies that user 1 had not predicted.

Figure 9: Ratings have now been predicted for movies that user 1 had not rated (Rows 75 to 86).

Parameter optimization

The algorithm for k-NN has a number of parameters and prominent among them being k. Is there a way to automatically test for various Ks and decide the best k? Yes, there is.  Rebuild the RapidMiner workflow as shown in the next three diagrams using Optimize Parameters (Grid) operator. There are three Optimize Parameter operators available in RapidMiner. We will only use the Grid operator. You are free to use other two and especially Optimize Parameters (Evolutionary).

Use the Optimize Parameter operator to find out best k in Item k-NN.

Figure 10: Use the Optimize Parameter (Grid) operator to find out that k in Item k-NN that will improve the performance most. Click to enlarge image.

Create the workflow as shown in Figure 10 (above). Next, click the button ‘Edit Parameter Settings.. ‘ (see red arrow) and in the window that opens (see below), within Operators panel  three operators appear. Here, select the first operator, Item K-NN. On selection, in the middle panel, Parameters, all parameters of Item k-NN then get displayed. Select, k, and click on the button with forward arrow to take k to ‘Selected Parameters’ panel.  Under the Grid/Range text box set range for k (Min=1, Max=100 and Steps=10) and then click OK button.

Parameter k of Item k-NN is allowed to vary from k=1 to 100.

Figure 11. Parameter k of Item k-NN is allowed to vary from k=1 to 100. Click to enlarge image.

Back in the Workspace, double click the lower-right overlapping rectangles icon in the Optimize Parameters operator (see the red circle in Figure 10). It opens up Optimization Process as shown in the figure 12 (below). Construct the workflow for the process to be optimized.

Within the Optimize Parameter operator, build

Figure 12: Within the Optimize Parameter operator, construct the workflow as shown here. Run the process (F11).

Run the complete process (press F11). The optimization process runs and after some time gives the following  results:

Result after process optimization. k = 21.

Figure 13: Result after process optimization. Best performance is achieved with k=21.

Ensemble of models

A number of item ratings algorithms can be connected in parallel and their output combined in a weighted fashion to create a model for recommender. For the purpose, we have to use two more operators: A process operator, Multiply, and a model combiner operator, Model Combiner (Rating Prediction). The workflow is shown below.

Using process Multiply operator and Model Combiner (Rating Prediction) operator to apply multiple models to a set of data.

Figure 14: Using process operator, Multiply, and operator, Model Combiner (Rating Prediction) to apply multiple models to a set of data. Click to enlarge image.

On running this multiple model process, with default parameters, there is a slight improvement in results. The results are as below.

The performance of multiple models. The result is better than a single model.

Figure 15: The performance of multiple models. The result is better than a single model.

Optimizing Parameters within ensemble

As before, we can optimize the parameters here also within the ensemble using the Optimize Parameter (Grid) operator. The workflow within the optimization process window is shown below.

Optimizing parameters in an ensemble.

Figure 16: Optimizing parameters in an ensemble.

We will try to optimize the learning rate in Matrix Factorization algorithm. The results are as below with a very marginal improvement over the earlier results.

Figure 17: Optimized performance of ensemble. Optimization was done for learning rate in MF.

Simultaneous performance measurement & finding user ratings for unrated items

In RapidMiner, you can carry out steps to measure model performance as also get ratings for unrated user items simultaneously with the following workflow. We have used multiplier operator to get more than one output of Item k-NN model. One output is used to test and measure performance and the other to get ratings for unrated items for some user(s).

Figure: Measuring performance and also predicting ratings for unrated user items.

Figure 18: Measuring performance and also predicting ratings for unrated user items. Click to enlarge.

 

This completes our discussion on using RapidMiner as recommender engine.  I hope the article does prove useful.

 

Advertisements

Tags: , , , , , , , , , , , ,

9 Responses to “Recommender System using RapidMiner”

  1. Joseph Robertson Says:

    Nice post. Thanks.

    I have a question on the typing of attributes in the section “Creating and reading Attribute Description file (AML files)”. You have marked both “user_id” and “item_id” as integers. Shouldn’t they be “nominal”?

    • ashokharnal Says:

      I think it should not make any difference so long as we are able to assign ‘user identification’ and ‘item identification’ roles to user_id and ‘item_id’. You can try and let us know if the results differ. Thanks.

      • Joseph Robertson Says:

        I was reviewing the source code and realized that ids correspond to positions in matrix. I got confused and thought item_id and user_id were used as predictors. Since these are matrix positions, integer is good. Thanks for your response. Looking forward to more posts from you. Appreciate you taking all the time to put this together.

  2. Alina GHERMAN Says:

    Firstly, thank you for this post!

    I would have a question: I do not understant several things:
    1. how is going the k-nn algorithm going to make the similarities (cosine/pearson?…)
    2. how does rapidMiner knows to associate the dat file informations (user_id and item_id) with the other features of the items and of the user?

    Thank you in advance!

    • ashokharnal Says:

      As to question 1, please see Figure 6. In the panel on the right where item k-NN’s parameters are listed, against ‘Correlation mode’, you can select either Cosine or Pearson as criteria for making similarities. As to question 2, RapidMiner or rather this mining algorithm does not consider other attributes of items (like movie’s Director, actors etc) except user’s preferences. For making recommendations based on item attributes, you will have to look for other data mining algorithms.

  3. Alina GHERMAN Says:

    Hi, thank you for your prompt answer! Could you help me by giving some indications of what others data mining algorithms should I look in order to take into consideration other attributes?

  4. Sumana Ghosh Says:

    Where is this data_analysis file stored? Please reply its urgent. ASAP.

  5. Anshu Sang Says:

    In Figure 18: Measuring performance and also predicting ratings for unrated user items the first 2 read aml for training and testing which type of data third read aml include in the figure .Reply fast. I am waiting for your response.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: