Logistic Regression using RapidMiner Studio 6, R and Mahout command line–Part III

This is Part-III of the series. In Part-I we described the data under analysis and in Part-II we used RapidMiner to draw up a logistic regression model. We now use R for the purpose.

R is a command/program oriented framework. It, therefore, has a steeper learning curve than other menu based statistical tools. However, once the initial hurdle is crossed, one is rewarded with a very rich set of a variety of statistical processing tools backed up by plenty of documentation. In Part-I of this series we have described the data for which we have to draw up a prediction model. In Part-II we showed how to use RapidMiner for the purpose. In this Part, we will use R to build logistic model.

There is no need to transform bank data set as we did in Part-I (i.e. transform categorical values into integers). We just have to point out to R as to which variables are categorical and R will take care of them by itself while summarising the data or while building the model. We will proceed in three steps. In Step 1, we import the data and look at its descriptive statistics. In Step 2, we build the model and in Step 3, we look at its performance and use the model to make predictions.

# Read semicolon separated data.
# The data is read as a data-frame and assigned to bankdata
bankdata=read.csv("/home/ashokharnal/r/bank-full.csv",sep=";")
# We will partition the data-set in 70:30 ratio
# for training purposes and for validation purposes.
# Import library 'caret'
library(caret)
set.seed(3456)
# Indicate the classification variable
bankdata$y=factor(bankdata$y)
# Split now
trainindex<-createDataPartition(bankdata$y,p=0.7,list=FALSE)
# Get training data-frame
banktrain<-bankdata[trainindex,]
# Get data-frame for testing
banktest<-bankdata[-trainindex,]
# Check number of rows in split-data set
nrow(banktrain)
nrow(banktest)
# Alternatively, check dimensions of both datasets
dim(banktrain)
dim(banktest)
# Do both data-sets contain column headings. Check.
names(banktrain)
names(banktest)
# Indicate now which variables are categorical variables
banktrain$y=factor(banktrain$y)
banktrain$job=factor(banktrain$job)
banktrain$marital=factor(banktrain$marital)
banktrain$education=factor(banktrain$education)
banktrain$housing=factor(banktrain$housing)
banktrain$loan=factor(banktrain$loan)
banktrain$contact=factor(banktrain$contact)
banktrain$month=factor(banktrain$month)
banktrain$poutcome=factor(banktrain$poutcome)
# Get descriptive stats for data
summary(banktrain)

We have used caret package (Classification And REgression Training) for data splitting. Package ‘caret’ contains a set of functions to streamline the process for creating predictive models. Function createDataPartition() causes balanced splits of the data. If the class argument to this function is a factor (as in this case), the random sampling occurs within each class and should preserve the overall class distribution of the data. The function requires a random number (set.seed) to be set. We get a training data-set, banktrain, and test data-set, bank-test. We also indicate above which of the fields are categorical (factor) in nature. The summary output (descriptive statistics) are as follows:

> summary(banktrain)

      age                 job           marital          education
 Min.   :18.00   blue-collar:6786   divorced: 3622   primary  : 4722
 1st Qu.:33.00   management :6613   married :19058   secondary:16298
 Median :39.00   technician :5369   single  : 8969   tertiary : 9329
 Mean   :40.92   admin.     :3602                    unknown  : 1300
 3rd Qu.:48.00   services   :2929
 Max.   :95.00   retired    :1574
                 (Other)    :4776

 default        balance       housing      loan            contact
 no :31076   Min.   : -8019   no :14164   no :26564   cellular :20468
 yes:  573   1st Qu.:    75   yes:17485   yes: 5085   telephone: 2056
             Median :   454                           unknown  : 9125
             Mean   :  1360
             3rd Qu.:  1432
             Max.   :102127

      day            month         duration         campaign
 Min.   : 1.00   may    :9530   Min.   :   0.0   Min.   : 1.000
 1st Qu.: 8.00   jul    :4864   1st Qu.: 103.0   1st Qu.: 1.000
 Median :16.00   aug    :4376   Median : 181.0   Median : 2.000
 Mean   :15.78   jun    :3776   Mean   : 259.9   Mean   : 2.747
 3rd Qu.:21.00   nov    :2776   3rd Qu.: 321.0   3rd Qu.: 3.000
 Max.   :31.00   apr    :2070   Max.   :4918.0   Max.   :55.000
                 (Other):4257

     pdays           previous         poutcome       y
 Min.   : -1.00   Min.   : 0.000   failure: 3410   no :27946
 1st Qu.: -1.00   1st Qu.: 0.000   other  : 1275   yes: 3703
 Median : -1.00   Median : 0.000   success: 1057
 Mean   : 39.82   Mean   : 0.568   unknown:25907
 3rd Qu.: -1.00   3rd Qu.: 0.000
 Max.   :854.00   Max.   :58.000

Note from above that variation in age is from 18 to 95 but third quartile at 48 years implies that younger persons outnumber older persons. Also, (Other) category of Job, groups such categories as ‘student’. ‘self-employed’
“admin.”,”unknown”,”unemployed”,”management”,”housemaid”,”entrepreneur”,”student”, “housemaid”, “unemployed” and “unknown”.
Any of these may be important as we will shortly see. As to Credit in default, only 2% are defaulters. Since 98% are non-defaulters the data may be of no use in making predictions. Also note the statistics about the class variable. ‘y’. Around 28000 have not subscribed to term deposit and only 3703 have subscribed. Model will have to work really hard to successfully find pattern. This said, let us now proceed with building building model. We will use Generalised Linear Modelling . In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function. Link function in our case is logit function.

# Use glm function, with y as dependent variable and all others as independent variables (IVs), written
#  in the usual manner with '+' sign.
logitModel=glm(y~poutcome+previous+pdays+campaign+duration+month+day+contact+loan+housing+balance+default+education+marital+job+age, data=banktrain,family=binomial(logit))
# As all variables besides, y, are IVs, we could have used a shorter form (with dot), as:
logitModel=glm(y~., data=banktrain,family=binomial(logit))
# Get model summary
summary(logitModel)

The model summary is shown next.

> summary(logitModel)

Call:
glm(formula = y ~ ., family = binomial(logit), data = banktrain)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-5.6771  -0.3754  -0.2549  -0.1505   3.4325

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)
(Intercept)        -2.665e+00  2.198e-01 -12.122  < 2e-16 ***
age                 2.862e-04  2.633e-03   0.109  0.91346
jobblue-collar     -2.465e-01  8.752e-02  -2.817  0.00485 **
jobentrepreneur    -2.497e-01  1.471e-01  -1.698  0.08952 .
jobhousemaid       -3.159e-01  1.582e-01  -1.997  0.04582 *
jobmanagement      -7.378e-02  8.841e-02  -0.835  0.40397
jobretired          2.571e-01  1.167e-01   2.203  0.02759 *
jobself-employed   -2.568e-01  1.343e-01  -1.912  0.05586 .
jobservices        -1.698e-01  1.007e-01  -1.686  0.09180 .
jobstudent          3.429e-01  1.320e-01   2.597  0.00939 **
jobtechnician      -1.166e-01  8.358e-02  -1.395  0.16315
jobunemployed      -4.978e-02  1.333e-01  -0.373  0.70885
jobunknown         -1.998e-01  2.718e-01  -0.735  0.46233
maritalmarried     -1.959e-01  7.002e-02  -2.797  0.00516 **
maritalsingle       8.012e-02  8.019e-02   0.999  0.31779
educationsecondary  1.591e-01  7.670e-02   2.074  0.03809 *
educationtertiary   3.350e-01  8.975e-02   3.733  0.00019 ***
educationunknown    2.744e-01  1.226e-01   2.238  0.02524 *
defaultyes          7.072e-03  1.878e-01   0.038  0.96996
balance             1.348e-05  6.174e-06   2.183  0.02903 *
housingyes         -6.644e-01  5.238e-02 -12.684  < 2e-16 ***
loanyes            -4.697e-01  7.231e-02  -6.495 8.30e-11 ***
contacttelephone   -8.801e-02  8.794e-02  -1.001  0.31689
contactunknown     -1.637e+00  8.774e-02 -18.655  < 2e-16 ***
day                 8.951e-03  2.977e-03   3.006  0.00265 **
monthaug           -6.874e-01  9.428e-02  -7.291 3.09e-13 ***
monthdec            7.854e-01  2.005e-01   3.917 8.96e-05 ***
monthfeb           -8.528e-02  1.060e-01  -0.805  0.42094
monthjan           -1.354e+00  1.526e-01  -8.875  < 2e-16 ***
monthjul           -7.605e-01  9.232e-02  -8.237  < 2e-16 ***
monthjun            4.912e-01  1.122e-01   4.379 1.19e-05 ***
monthmar            1.765e+00  1.416e-01  12.468  < 2e-16 ***
monthmay           -3.817e-01  8.683e-02  -4.396 1.10e-05 ***
monthnov           -7.817e-01  1.002e-01  -7.799 6.25e-15 ***
monthoct            9.693e-01  1.286e-01   7.536 4.83e-14 ***
monthsep            8.515e-01  1.422e-01   5.988 2.12e-09 ***
duration            4.104e-03  7.584e-05  54.122  < 2e-16 ***
campaign           -9.137e-02  1.219e-02  -7.494 6.68e-14 ***
pdays               2.919e-04  3.644e-04   0.801  0.42300
previous            3.424e-02  1.159e-02   2.954  0.00313 **
poutcomeother       8.240e-03  1.093e-01   0.075  0.93991
poutcomesuccess     2.228e+00  9.802e-02  22.729  < 2e-16 ***
poutcomeunknown     2.050e-03  1.145e-01   0.018  0.98572 

--- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
(Dispersion parameter for binomial family taken to be 1)

 Null deviance: 22845  on 31648  degrees of freedom
 Residual deviance: 15129  on 31606  degrees of freedom

 AIC: 15215
 Number of Fisher Scoring iterations: 6

 

Coefficients of independent variables in the linear equation are given above. Wald’s z-test is perfromed to bring out which of the coefficients are significantly different from zero. Those coefficients which have a dot or a star at the end are significantly different, others are not. Thus ‘pdays’ and ‘age’ can be safely removed from the model. Further note that among three poutcome(s), two of them, poutcomeunknown and poutcomeother, are not significant, only poutcomesuccess is. Overall, is poutcome significant? This can be gauged using Wald’s test available in ‘aod’ package.

# Conduct wald's test on poutcome at serial number 41 to 43. Coeff (Terms), 41
#  to 43 in the coeff table are jointly tested
> library(aod)
> wald.test(b = coef(logitModel), Sigma = vcov(logitModel), Terms = 41:43)
Wald test:
----------

Chi-squared test:
X2 = 731.2, df = 3, P(> X2) = 0.0

With a p value of 0.0, poutcome is significant. A similar test for maritalsingle (non-significant, term: 15) and maritalmarried (significant, term:14), yields the result that jointly they are significant. Should you construct another model after removing non-significant variables, note then the AIC value. The Akaike information criterion (AIC) is a measure of the relative quality of a statistical model, for a given set of data. As such, AIC provides a means for model selection. On its own absolute value of AIC is of no use. You can use stepAIC command available in the MASS library to automatically carry out a restricted search for the “best” model, as measured by AIC. stepAIC accepts both categorical and numerical variables. It does not search all possible subsets of variables, but rather it searches in a specific sequence determined by the order of the variables in the formula of the model object.

Deviance is an important measure of fit of a model. It is also used to compare models. The deviance of a model is 2 times the log likelihood of the data under the model plus a constant that would be the same for all models for the same data, and so can be ignored. The larger the deviance, the worse the fit. Also two deviances that of null model (i.e. model with only intercept term) and model with independent variables can be compared to find out if the fit is significantly better. The deviance of difference has chi-square statistic. Residual deviance of 15129 with 31606 degrees of freedom has p-value of 0 as can be seen below and indicating good model fit.

> pchisq(15129,df=31606)
[1] 0
> 

The model, that we constructed, has the form:

log(p/(1-p) = intercept + c1*x1 + c2*x2+ c3*x3…..

The ratio, p/(1-p) is the odds ratio. To calculate odds ratio, we can exponentiate coefficients.


# Find odds ratio as:
exp(coef(logitModel))

# Or better still with 95% confidence interval, as:
exp(cbind(OddsRatio=coef(logitModel),confint(logitModel)))

The result can is as below:

                    OddsRatio      2.5 %     97.5 %
(Intercept)        0.06961723 0.04519572  0.1069932
age                1.00028622 0.99513022  1.0054570
jobblue-collar     0.78152131 0.65841888  0.9279574
jobentrepreneur    0.77904585 0.58115945  1.0347133
jobhousemaid       0.72911497 0.53184047  0.9892007
jobmanagement      0.92887269 0.78145798  1.1051934
jobretired         1.29321532 1.02837065  1.6251009
jobself-employed   0.77351038 0.59275182  1.0037319
jobservices        0.84381316 0.69202032  1.0271966
jobstudent         1.40906871 1.08643664  1.8231978
jobtechnician      0.88998429 0.75572355  1.0487571
jobunemployed      0.95143826 0.73071041  1.2325361
jobunknown         0.81890624 0.47112825  1.3708220
maritalmarried     0.82212563 0.71735677  0.9439807
maritalsingle      1.08341237 0.92644281  1.2687139
educationsecondary 1.17242630 1.00961551  1.3638226
educationtertiary  1.39792771 1.17313231  1.6678428
educationunknown   1.31578091 1.03286746  1.6707020
defaultyes         1.00709671 0.68672796  1.4360142
balance            1.00001348 1.00000121  1.0000255
housingyes         0.51455711 0.46425005  0.5700844
loanyes            0.62520767 0.54174148  0.7193315
contacttelephone   0.91574909 0.76936952  1.0861102
contactunknown     0.19461557 0.16368937  0.2308932
day                1.00899075 1.00311967  1.0148967
monthaug           0.50289086 0.41809127  0.6050744
monthdec           2.19320466 1.47725368  3.2441840
monthfeb           0.91825609 0.74581870  1.1299667
monthjan           0.25821116 0.19045604  0.3465018
monthjul           0.46745301 0.39008422  0.5602142
monthjun           1.63432029 1.31182006  2.0364832
monthmar           5.84128618 4.42688513  7.7123317
monthmay           0.68267604 0.57604036  0.8096684
monthnov           0.45760929 0.37573926  0.5566459
monthoct           2.63614726 2.04797977  3.3911321
monthsep           2.34312269 1.77218907  3.0948956
duration           1.00411286 1.00396456  1.0042631
campaign           0.91268310 0.89074341  0.9343410
pdays              1.00029198 0.99957596  1.0010053
previous           1.03482952 1.01096152  1.0581795
poutcomeother      1.00827444 0.81254711  1.2474276
poutcomesuccess    9.27986285 7.66483355 11.2562505
poutcomeunknown    1.00205188 0.80113330  1.2553803

It indicates, for example, that with a unit increase in number of retired persons odds increase by a factor of 1.29 that a term deposit would be taken. In the same breath, value of odds ration being less than 1 is, in fact, the value of odds failure. On this see Stata FAQ here.

Recall that we had split the bank dataset into banktrain and banktest (75:25). For constructing model, i.e. for training purposes, we have used banktrain dataset. We can now test how good our model is using banktest dataset and drawing up confusion matrix.

# Calculate probabilities
# We use only columns 1 to 16 of banktest dataset as column 17
# contains the class target, y, to be predicted
# First probabilities of each event
probabilities <- predict.glm(logitModel, banktest[,1:16], type="response")
# Convert probabilities into class values. 
# 0.5 is the threshold value.
predictions <- cut(probabilities, c(-Inf,0.5,Inf), labels=c("no","yes"))
# Draw up confusion matrix
table(predictions, banktest$y)

The confusion matrix is given below:

predictions    no   yes   
        no  11702  1068    
        yes   274   518

There is, therefore, a prediction accuracy of 91.6% in regard to prediction of ‘no’s and 65.4% in regard to prediction of ‘yes’s. We can get a similar result using CVbinary function of DAAG package as below:



> library(DAAG)
> CVbinary(logitModel)

Fold:  5 9 1 6 8 7 3 10 2 4
Internal estimate of accuracy = 0.902
Cross-validation estimate of accuracy = 0.902

You can also examine collinearity among variables and eliminate highly correlated variables as below:

X<-model.matrix(logitModel)[,-1]
round(cor(X),2)

Some good references for logistic regression using R are here: [1] and [2]

In Part-IV we will use mahout from command line to build logit model.

Tags: , , ,

One Response to “Logistic Regression using RapidMiner Studio 6, R and Mahout command line–Part III”

  1. andy Says:

    trying to follow along. Unable to get bank data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: