Logistic Regression using RapidMiner Studio 6, R and Mahout command line–Part IV

This is the last part of the series on logistic regression where we use different statistical tools for modelling. Data for modelling is described in Part-1 and in that Part we have also performed some data preparation steps such as separating training and hold-out samples and converting categorical data to integers (levels). In Part-II we have carried out model building using RapidMiner and in Part-III we have used R to create a logistic regression model. In this part we use mahout from command line to build our model.

Given the training data and the validation sample, carrying out analysis from mahout command line is quite straight forward. For best possible results, there are few model parameters that need to be adjusted by trial and error. Mahout training command is ‘trainlogistic’ to train the machine and build the model. The arguments to this command and their explanations are given below.

Command: mahout trainlogistic
  Flag       Any argument to flag      Description
--help                                  print this list
--quiet                                 be extra quiet
--input       input                     where to get training data
--output      output                    where to get training data
--target      target                    the name of the target variable
--categories  number                    the number of target categories to be considered
--predictors p [p ...]                  a list of predictor variables
--types      t [t ...]                  a list of predictor variables types (numeric, word or text)
--passes     passes                     the number of times to pass over the input data
--lambda     lambda                     the amount of coeffiecient decay to use
--rate	     learningRate               the learning rate
--noBias                                do not include a bias term
--features   numFeatures                the number of internal hashed features to use

In model fitting, we try to reduce error between the observed value and the predicted value. Mahout uses Stochastic gradient descent method to reduce the error. In Stochastic gradient descent (SGD), we go on adjusting the model weights so long as the error is getting reduced. That is, we try to walk down the gradient of the error-curve (imagine it to be U-shaped) till a (global) cusp is reached. Any further attempt to reduce the error by adjusting weights will increase  it. In any such process of slipping down the curve, the quantum of slip depends upon the learning rate. Learning rate determines how fast we learn from new values in contrast to old values. Large learning rate places greater emphasis on newer values than the older values. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of data and makes calculations for estimating the value of error at every step. This is very effective in the case of large-scale machine learning problems. It may be pointed out that there may be a global minimum and local minimum of the error curve. The appropriate value of learning rate is, therefore, important. As one descends through the error curve, its slope may show flattening. Both the learning rate as also weights need to be appropriately adjusted. In order to actually reach the minimum, and stay there, we must gradually lower the global learning rate as also weights. Value of lambda is used for the purpose. Number of passes indicates how many times input is passed over for model building.

Once the model is ready, it can be validated and predictions made using the hold-out sample. The mahout command for the purpose is ‘runlogistic’. You can get help on it by using ‘–help’ argument to this command as follows:

[ashokharnal@master mahout]$ mahout runlogistic --help
 Usage:                                                                                                                       [--help --quiet --auc --scores --confusion --input <input> --model <model>]                                                     
 --help|--quiet|--auc|--scores|--confusion|--input|--model                                                                        
   --help                         print this list                                                         
   --quiet                        be extra quiet                                                          
   --auc                          print AUC                                                               
   --scores                       print scores                                                            
   --confusion                    print confusion matrix                                                  
   --input input                  where to get training data                                              
   --model model                  where to get a model        

Given now the ‘training,csv’ file and ‘tovalidate.csv’ file, the complete shell script is written below. The script is well commented to make it clearer.

#!/bin/bash

export HADOOP_CLASSPATH="/usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/*:$HADOOP_CLASSPATH"
datafolder="/home/ashokharnal/mahout"
modelfolder="model"
# Files, training and hold-out
datafile="training.csv"
validatefile="tovalidate.csv"
# Our model will be created in a file by this name
modelfile="model"

cd $datafolder
# Delete any existing model directory
rm -r -f $datafolder/$modelfolder
# And create again one
mkdir $datafolder/$modelfolder

# Copy tovalidate.csv from datafolder to model folder
cp $datafolder/$validatefile  $datafolder/model/$validatefile 

########Begin Training#################
echo " " ; echo " "
echo "Training phase. Buiding model: logit"
echo "-------------------------------"
echo " "; echo " "

mahout trainlogistic \
  --input $datafolder/$datafile \
  --output $datafolder/$modelfolder/$modelfile \
  --target y \
  --categories 2 \
  --predictors age    job   marital education default     balance  housing  loan  contact day     month  duration  campaign pdays previous  poutcome \
  --types     numeric word  word    word      word        numeric  word     word  word    numeric  word  numeric   numeric  numeric numeric word \
  --features 49  --passes 1000 --lambda 0.0001 \
  --rate 0.5

###############Model validation#################
echo " " ; echo " "
echo "---------------------------------------------------"  
echo -n "Validation phase. Begin validation? (Press ENTER key) " ; read x
echo "---------------------------------------------------"  

mahout runlogistic \
  --input $datafolder/$modelfolder/$validatefile \
  --model $datafolder/$modelfolder/$modelfile \
   --scores --auc --confusion  > $datafolder/output.txt

echo -n "Score and other output is in file output.txt"   ; read x

The ‘categories’ are obviously 2 (target y: yes/no). Why have we chosen ‘features’ as 49? It is because the total number of (categorical+numerical) levels per record are 49. For example, attribute ‘job’ has 12 levels. After selecting this value, one can play around it (say increase to 100 or reduce to 30) to see which value is the best. For an easy explanation of what are ‘features’ and ‘Features hash’ or what is know as ‘hash trick’ please refer to a) Wikipedia, b) StackExchange Q/A and c) this article in the order listed. These are the simplest explanations I could find.

In our case, minimum number of features in categorical variables and numeric variable (count as 1 per numeric variable) are 49. While training, we have differentiated between the categorical variable types and numeric variable types through symbols ‘n’ and ‘word’. The model is as follows.

y ~ -0.796*Intercept Term + -17.705*age + 0.006*balance + -0.732*campaign 
+ -0.396*contact=1 + -0.836*contact=2 + -0.305*contact=3 + -2.182*day 
+ -0.906*default=0 + -0.002*default=1 + 0.738*duration + -0.500*education=1 
+ -2.623*education=2 + -1.132*education=3 + -2.977*education=4 + -2.258*housing=0 
+ -0.784*housing=1 + 0.013*job=1 + -0.332*job=10 + -0.486*job=11 + -17.704*job=12 
+ -0.795*job=2 + -0.820*job=3 + -0.739*job=4 + -0.090*job=5 + -0.805*job=6 + -0.396*job=7 
+ -0.629*job=8 + -0.466*job=9 + -1.121*loan=0 + -0.149*loan=1 + -0.543*marital=1 
+ -1.024*marital=2 + -0.587*marital=3 + -0.120*month=1 + -0.279*month=10 + -0.442*month=11 
+ -0.276*month=12 + -0.373*month=2 + -0.326*month=3 + -0.502*month=4 + -1.191*month=5 
+ -0.001*month=6 + -0.001*month=7 + -0.394*month=8 + -0.342*month=9 + 0.394*pdays 
+ -0.395*poutcome=1 + -1.190*poutcome=2 + 0.738*poutcome=3 + -18.216*poutcome=4 + 0.000*previous
      Intercept Term -0.79622
                 age -17.70507
             balance 0.00645
            campaign -0.73175
           contact=1 -0.39627
           contact=2 -0.83619
           contact=3 -0.30474
                 day -2.18161
           default=0 -0.90618
           default=1 -0.00165
            duration 0.73830
         education=1 -0.49953
         education=2 -2.62331
         education=3 -1.13234
         education=4 -2.97668
           housing=0 -2.25766
           housing=1 -0.78387
               job=1 0.01325
              job=10 -0.33196
              job=11 -0.48632
              job=12 -17.70413
               job=2 -0.79466
               job=3 -0.82046
               job=4 -0.73949
               job=5 -0.09014
               job=6 -0.80492
               job=7 -0.39627
               job=8 -0.62859
               job=9 -0.46582
              loan=0 -1.12142
              loan=1 -0.14857
           marital=1 -0.54272
           marital=2 -1.02386
           marital=3 -0.58652
             month=1 -0.12023
            month=10 -0.27851
            month=11 -0.44170
            month=12 -0.27578
             month=2 -0.37268
             month=3 -0.32635
             month=4 -0.50196
             month=5 -1.19071
             month=6 -0.00078
             month=7 -0.00079
             month=8 -0.39449
             month=9 -0.34164
               pdays 0.39449
          poutcome=1 -0.39548
          poutcome=2 -1.18956
          poutcome=3 0.73830
          poutcome=4 -18.21576
            previous 0.00000

The model becomes quite complex and practically littered with dummy variables. The number of independent dummy variables are quite a lot. We could, for example, have treated ‘month’ variable as numeric to simplify things. The validation output is as below:

"target","model-output","log-likelihood"
0,0.000,-0.000000
0,0.000,-0.000000
0,0.000,-0.000000
0,0.000,-0.000000
0,0.000,-0.000000
0,0.000,-0.000000
0,0.000,-0.000000
0,0.000,-0.000000
0,0.000,-0.000000
0,0.000,-0.000000
0,0.000,-0.000000
0,0.000,-0.000000
0,0.000,-0.000000
0,0.000,-0.000000
1,0.000,-100.000000
....
..........
AUC = 0.84
confusion: [[4144.0, 199.0], [71.0, 86.0]]
entropy: [[NaN, NaN], [-45.2, -9.4]]

Area under ROC (Receiver operating characteristics) curve known as AUC illustrates the discriminating power of the model between true positive rate (i.e. the number of 1s correctly classified) and false positive rate (number of 0s wrongly classified). For a simple explanation of AUC please refer this entry in my blog: Simple explanation of AUC. While AUC of 0.84 may be good enough (max: 1, higher the better), the results in confusion matrix may not be that encouraging. As per the confusion matrix, out of a total of 4343 predictions as ‘no’, 4144 or 95.4% are correct. And out of predictions for ‘yes’ in 157 cases, 55% are correct predictions. You may experiment with changing learning rate and lamda. But in this particular case this is the best we could get.

Tags: , , , , ,

3 Responses to “Logistic Regression using RapidMiner Studio 6, R and Mahout command line–Part IV”

  1. Adeyemi Odeneye Says:

    Good job!That was well written could you do something on fraud detection or market basket analysis using Rapidminer.

  2. Lukasz Says:

    Great articles, but i still don’t understand how to use the model in real life example.

    Two questions:
    1) what is the meaning of taget/model-output/log-likehood
    2) the bank would like to use the model to identify a target group of the remaining part of the marketing campaign based on the previous results. in other words, the bank already has the knowledge which customers subscribed for the debit and which didn’t. Bank know their characteristics. And now, they have next 100.000 potential customers and they want to select 1000 who will likely subscribe for the product. how to do this ?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: