This is Part-III of the series. In Part-I we described the data under analysis and in Part-II we used RapidMiner to draw up a logistic regression model. We now use R for the purpose.

R is a command/program oriented framework. It, therefore, has a steeper learning curve than other menu based statistical tools. However, once the initial hurdle is crossed, one is rewarded with a very rich set of a variety of statistical processing tools backed up by plenty of documentation. In Part-I of this series we have described the data for which we have to draw up a prediction model. In Part-II we showed how to use RapidMiner for the purpose. In this Part, we will use R to build logistic model.

There is no need to transform bank data set as we did in Part-I (i.e. transform categorical values into integers). We just have to point out to R as to which variables are categorical and R will take care of them by itself while summarising the data or while building the model. We will proceed in three steps. In Step 1, we import the data and look at its descriptive statistics. In Step 2, we build the model and in Step 3, we look at its performance and use the model to make predictions.

# Read semicolon separated data. # The data is read as a data-frame and assigned to bankdata bankdata=read.csv("/home/ashokharnal/r/bank-full.csv",sep=";") # We will partition the data-set in 70:30 ratio # for training purposes and for validation purposes. # Import library 'caret' library(caret) set.seed(3456) # Indicate the classification variable bankdata$y=factor(bankdata$y) # Split now trainindex<-createDataPartition(bankdata$y,p=0.7,list=FALSE) # Get training data-frame banktrain<-bankdata[trainindex,] # Get data-frame for testing banktest<-bankdata[-trainindex,] # Check number of rows in split-data set nrow(banktrain) nrow(banktest) # Alternatively, check dimensions of both datasets dim(banktrain) dim(banktest) # Do both data-sets contain column headings. Check. names(banktrain) names(banktest) # Indicate now which variables are categorical variables banktrain$y=factor(banktrain$y) banktrain$job=factor(banktrain$job) banktrain$marital=factor(banktrain$marital) banktrain$education=factor(banktrain$education) banktrain$housing=factor(banktrain$housing) banktrain$loan=factor(banktrain$loan) banktrain$contact=factor(banktrain$contact) banktrain$month=factor(banktrain$month) banktrain$poutcome=factor(banktrain$poutcome) # Get descriptive stats for data summary(banktrain)

We have used *caret* package (Classification And REgression Training) for data splitting. Package ‘caret’ contains a set of functions to streamline the process for creating predictive models. Function createDataPartition() causes balanced splits of the data. If the class argument to this function is a factor (as in this case), the random sampling occurs within each class and should preserve the overall class distribution of the data. The function requires a random number (set.seed) to be set. We get a training data-set, banktrain, and test data-set, bank-test. We also indicate above which of the fields are categorical (factor) in nature. The summary output (descriptive statistics) are as follows:

> summary(banktrain) age job marital education Min. :18.00 blue-collar:6786 divorced: 3622 primary : 4722 1st Qu.:33.00 management :6613 married :19058 secondary:16298 Median :39.00 technician :5369 single : 8969 tertiary : 9329 Mean :40.92 admin. :3602 unknown : 1300 3rd Qu.:48.00 services :2929 Max. :95.00 retired :1574 (Other) :4776 default balance housing loan contact no :31076 Min. : -8019 no :14164 no :26564 cellular :20468 yes: 573 1st Qu.: 75 yes:17485 yes: 5085 telephone: 2056 Median : 454 unknown : 9125 Mean : 1360 3rd Qu.: 1432 Max. :102127 day month duration campaign Min. : 1.00 may :9530 Min. : 0.0 Min. : 1.000 1st Qu.: 8.00 jul :4864 1st Qu.: 103.0 1st Qu.: 1.000 Median :16.00 aug :4376 Median : 181.0 Median : 2.000 Mean :15.78 jun :3776 Mean : 259.9 Mean : 2.747 3rd Qu.:21.00 nov :2776 3rd Qu.: 321.0 3rd Qu.: 3.000 Max. :31.00 apr :2070 Max. :4918.0 Max. :55.000 (Other):4257 pdays previous poutcome y Min. : -1.00 Min. : 0.000 failure: 3410 no :27946 1st Qu.: -1.00 1st Qu.: 0.000 other : 1275 yes: 3703 Median : -1.00 Median : 0.000 success: 1057 Mean : 39.82 Mean : 0.568 unknown:25907 3rd Qu.: -1.00 3rd Qu.: 0.000 Max. :854.00 Max. :58.000

Note from above that variation in age is from 18 to 95 but third quartile at 48 years implies that younger persons outnumber older persons. Also, (Other) category of Job, groups such categories as ‘student’. ‘self-employed’

“admin.”,”unknown”,”unemployed”,”management”,”housemaid”,”entrepreneur”,”student”, “housemaid”, “unemployed” and “unknown”.

Any of these may be important as we will shortly see. As to Credit in default, only 2% are defaulters. Since 98% are non-defaulters the data may be of no use in making predictions. Also note the statistics about the class variable. ‘y’. Around 28000 have not subscribed to term deposit and only 3703 have subscribed. Model will have to work really hard to successfully find pattern. This said, let us now proceed with building building model. We will use Generalised Linear Modelling . In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function. Link function in our case is logit function.

# Use glm function, with y as dependent variable and all others as independent variables (IVs), written # in the usual manner with '+' sign. logitModel=glm(y~poutcome+previous+pdays+campaign+duration+month+day+contact+loan+housing+balance+default+education+marital+job+age, data=banktrain,family=binomial(logit)) # As all variables besides, y, are IVs, we could have used a shorter form (with dot), as: logitModel=glm(y~., data=banktrain,family=binomial(logit)) # Get model summary summary(logitModel)

The model summary is shown next.

> summary(logitModel) Call: glm(formula = y ~ ., family = binomial(logit), data = banktrain) Deviance Residuals: Min 1Q Median 3Q Max -5.6771 -0.3754 -0.2549 -0.1505 3.4325 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.665e+00 2.198e-01 -12.122 < 2e-16 *** age 2.862e-04 2.633e-03 0.109 0.91346 jobblue-collar -2.465e-01 8.752e-02 -2.817 0.00485 ** jobentrepreneur -2.497e-01 1.471e-01 -1.698 0.08952 . jobhousemaid -3.159e-01 1.582e-01 -1.997 0.04582 * jobmanagement -7.378e-02 8.841e-02 -0.835 0.40397 jobretired 2.571e-01 1.167e-01 2.203 0.02759 * jobself-employed -2.568e-01 1.343e-01 -1.912 0.05586 . jobservices -1.698e-01 1.007e-01 -1.686 0.09180 . jobstudent 3.429e-01 1.320e-01 2.597 0.00939 ** jobtechnician -1.166e-01 8.358e-02 -1.395 0.16315 jobunemployed -4.978e-02 1.333e-01 -0.373 0.70885 jobunknown -1.998e-01 2.718e-01 -0.735 0.46233 maritalmarried -1.959e-01 7.002e-02 -2.797 0.00516 ** maritalsingle 8.012e-02 8.019e-02 0.999 0.31779 educationsecondary 1.591e-01 7.670e-02 2.074 0.03809 * educationtertiary 3.350e-01 8.975e-02 3.733 0.00019 *** educationunknown 2.744e-01 1.226e-01 2.238 0.02524 * defaultyes 7.072e-03 1.878e-01 0.038 0.96996 balance 1.348e-05 6.174e-06 2.183 0.02903 * housingyes -6.644e-01 5.238e-02 -12.684 < 2e-16 *** loanyes -4.697e-01 7.231e-02 -6.495 8.30e-11 *** contacttelephone -8.801e-02 8.794e-02 -1.001 0.31689 contactunknown -1.637e+00 8.774e-02 -18.655 < 2e-16 *** day 8.951e-03 2.977e-03 3.006 0.00265 ** monthaug -6.874e-01 9.428e-02 -7.291 3.09e-13 *** monthdec 7.854e-01 2.005e-01 3.917 8.96e-05 *** monthfeb -8.528e-02 1.060e-01 -0.805 0.42094 monthjan -1.354e+00 1.526e-01 -8.875 < 2e-16 *** monthjul -7.605e-01 9.232e-02 -8.237 < 2e-16 *** monthjun 4.912e-01 1.122e-01 4.379 1.19e-05 *** monthmar 1.765e+00 1.416e-01 12.468 < 2e-16 *** monthmay -3.817e-01 8.683e-02 -4.396 1.10e-05 *** monthnov -7.817e-01 1.002e-01 -7.799 6.25e-15 *** monthoct 9.693e-01 1.286e-01 7.536 4.83e-14 *** monthsep 8.515e-01 1.422e-01 5.988 2.12e-09 *** duration 4.104e-03 7.584e-05 54.122 < 2e-16 *** campaign -9.137e-02 1.219e-02 -7.494 6.68e-14 *** pdays 2.919e-04 3.644e-04 0.801 0.42300 previous 3.424e-02 1.159e-02 2.954 0.00313 ** poutcomeother 8.240e-03 1.093e-01 0.075 0.93991 poutcomesuccess 2.228e+00 9.802e-02 22.729 < 2e-16 *** poutcomeunknown 2.050e-03 1.145e-01 0.018 0.98572 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 22845 on 31648 degrees of freedom Residual deviance: 15129 on 31606 degrees of freedom AIC: 15215 Number of Fisher Scoring iterations: 6

Coefficients of independent variables in the linear equation are given above. Wald’s z-test is perfromed to bring out which of the coefficients are significantly different from zero. Those coefficients which have a dot or a star at the end are significantly different, others are not. Thus ‘pdays’ and ‘age’ can be safely removed from the model. Further note that among three poutcome(s), two of them, poutcomeunknown and poutcomeother, are not significant, only poutcomesuccess is. Overall, is poutcome significant? This can be gauged using Wald’s test available in ‘aod’ package.

# Conduct wald's test on poutcome at serial number 41 to 43. Coeff (Terms), 41 # to 43 in the coeff table are jointly tested > library(aod) > wald.test(b = coef(logitModel), Sigma = vcov(logitModel), Terms = 41:43) Wald test: ---------- Chi-squared test: X2 = 731.2, df = 3, P(> X2) = 0.0

With a p value of 0.0, poutcome is significant. A similar test for maritalsingle (non-significant, term: 15) and maritalmarried (significant, term:14), yields the result that jointly they are significant. Should you construct another model after removing non-significant variables, note then the AIC value. The Akaike information criterion (AIC) is a measure of the relative quality of a statistical model, for a given set of data. As such, AIC provides a means for model selection. On its own absolute value of AIC is of no use. You can use stepAIC command available in the MASS library to automatically carry out a restricted search for the “best” model, as measured by AIC. stepAIC accepts both categorical and numerical variables. It does not search all possible subsets of variables, but rather it searches in a specific sequence determined by the order of the variables in the formula of the model object.

Deviance is an important measure of fit of a model. It is also used to compare models. The deviance of a model is 2 times the log likelihood of the data under the model plus a constant that would be the same for all models for the same data, and so can be ignored. The larger the deviance, the worse the fit. Also two deviances that of null model (i.e. model with only intercept term) and model with independent variables can be compared to find out if the fit is significantly better. The deviance of difference has chi-square statistic. Residual deviance of 15129 with 31606 degrees of freedom has p-value of 0 as can be seen below and indicating good model fit.

> pchisq(15129,df=31606) [1] 0 >

The model, that we constructed, has the form:

log(p/(1-p) = intercept + c1*x1 + c2*x2+ c3*x3…..

The ratio, p/(1-p) is the odds ratio. To calculate odds ratio, we can exponentiate coefficients.

# Find odds ratio as: exp(coef(logitModel)) # Or better still with 95% confidence interval, as: exp(cbind(OddsRatio=coef(logitModel),confint(logitModel)))

The result can is as below:

OddsRatio 2.5 % 97.5 % (Intercept) 0.06961723 0.04519572 0.1069932 age 1.00028622 0.99513022 1.0054570 jobblue-collar 0.78152131 0.65841888 0.9279574 jobentrepreneur 0.77904585 0.58115945 1.0347133 jobhousemaid 0.72911497 0.53184047 0.9892007 jobmanagement 0.92887269 0.78145798 1.1051934 jobretired 1.29321532 1.02837065 1.6251009 jobself-employed 0.77351038 0.59275182 1.0037319 jobservices 0.84381316 0.69202032 1.0271966 jobstudent 1.40906871 1.08643664 1.8231978 jobtechnician 0.88998429 0.75572355 1.0487571 jobunemployed 0.95143826 0.73071041 1.2325361 jobunknown 0.81890624 0.47112825 1.3708220 maritalmarried 0.82212563 0.71735677 0.9439807 maritalsingle 1.08341237 0.92644281 1.2687139 educationsecondary 1.17242630 1.00961551 1.3638226 educationtertiary 1.39792771 1.17313231 1.6678428 educationunknown 1.31578091 1.03286746 1.6707020 defaultyes 1.00709671 0.68672796 1.4360142 balance 1.00001348 1.00000121 1.0000255 housingyes 0.51455711 0.46425005 0.5700844 loanyes 0.62520767 0.54174148 0.7193315 contacttelephone 0.91574909 0.76936952 1.0861102 contactunknown 0.19461557 0.16368937 0.2308932 day 1.00899075 1.00311967 1.0148967 monthaug 0.50289086 0.41809127 0.6050744 monthdec 2.19320466 1.47725368 3.2441840 monthfeb 0.91825609 0.74581870 1.1299667 monthjan 0.25821116 0.19045604 0.3465018 monthjul 0.46745301 0.39008422 0.5602142 monthjun 1.63432029 1.31182006 2.0364832 monthmar 5.84128618 4.42688513 7.7123317 monthmay 0.68267604 0.57604036 0.8096684 monthnov 0.45760929 0.37573926 0.5566459 monthoct 2.63614726 2.04797977 3.3911321 monthsep 2.34312269 1.77218907 3.0948956 duration 1.00411286 1.00396456 1.0042631 campaign 0.91268310 0.89074341 0.9343410 pdays 1.00029198 0.99957596 1.0010053 previous 1.03482952 1.01096152 1.0581795 poutcomeother 1.00827444 0.81254711 1.2474276 poutcomesuccess 9.27986285 7.66483355 11.2562505 poutcomeunknown 1.00205188 0.80113330 1.2553803

It indicates, for example, that with a unit increase in number of retired persons odds increase by a factor of 1.29 that a term deposit would be taken. In the same breath, value of odds ration being less than 1 is, in fact, the value of odds failure. On this see Stata FAQ here.

Recall that we had split the bank dataset into banktrain and banktest (75:25). For constructing model, i.e. for training purposes, we have used banktrain dataset. We can now test how good our model is using banktest dataset and drawing up confusion matrix.

# Calculate probabilities # We use only columns 1 to 16 of banktest dataset as column 17 # contains the class target, y, to be predicted # First probabilities of each event probabilities <- predict.glm(logitModel, banktest[,1:16], type="response") # Convert probabilities into class values. # 0.5 is the threshold value. predictions <- cut(probabilities, c(-Inf,0.5,Inf), labels=c("no","yes")) # Draw up confusion matrix table(predictions, banktest$y)

The confusion matrix is given below:

predictions no yes no 11702 1068 yes 274 518

There is, therefore, a prediction accuracy of 91.6% in regard to prediction of ‘no’s and 65.4% in regard to prediction of ‘yes’s. We can get a similar result using CVbinary function of DAAG package as below:

> library(DAAG) > CVbinary(logitModel) Fold: 5 9 1 6 8 7 3 10 2 4 Internal estimate of accuracy = 0.902 Cross-validation estimate of accuracy = 0.902

You can also examine collinearity among variables and eliminate highly correlated variables as below:

X<-model.matrix(logitModel)[,-1] round(cor(X),2)

Some good references for logistic regression using R are here: [1] and [2]

In Part-IV we will use mahout from command line to build logit model.