Posts Tagged ‘logistic regression in marketing’

Logistic Regression using RapidMiner Studio 6, R and Mahout command line–Part III

March 10, 2014

This is Part-III of the series. In Part-I we described the data under analysis and in Part-II we used RapidMiner to draw up a logistic regression model. We now use R for the purpose.

R is a command/program oriented framework. It, therefore, has a steeper learning curve than other menu based statistical tools. However, once the initial hurdle is crossed, one is rewarded with a very rich set of a variety of statistical processing tools backed up by plenty of documentation. In Part-I of this series we have described the data for which we have to draw up a prediction model. In Part-II we showed how to use RapidMiner for the purpose. In this Part, we will use R to build logistic model.

There is no need to transform bank data set as we did in Part-I (i.e. transform categorical values into integers). We just have to point out to R as to which variables are categorical and R will take care of them by itself while summarising the data or while building the model. We will proceed in three steps. In Step 1, we import the data and look at its descriptive statistics. In Step 2, we build the model and in Step 3, we look at its performance and use the model to make predictions.

# Read semicolon separated data.
# The data is read as a data-frame and assigned to bankdata
bankdata=read.csv("/home/ashokharnal/r/bank-full.csv",sep=";")
# We will partition the data-set in 70:30 ratio
# for training purposes and for validation purposes.
# Import library 'caret'
library(caret)
set.seed(3456)
# Indicate the classification variable
bankdata$y=factor(bankdata$y)
# Split now
trainindex<-createDataPartition(bankdata$y,p=0.7,list=FALSE)
# Get training data-frame
banktrain<-bankdata[trainindex,]
# Get data-frame for testing
banktest<-bankdata[-trainindex,]
# Check number of rows in split-data set
nrow(banktrain)
nrow(banktest)
# Alternatively, check dimensions of both datasets
dim(banktrain)
dim(banktest)
# Do both data-sets contain column headings. Check.
names(banktrain)
names(banktest)
# Indicate now which variables are categorical variables
banktrain$y=factor(banktrain$y)
banktrain$job=factor(banktrain$job)
banktrain$marital=factor(banktrain$marital)
banktrain$education=factor(banktrain$education)
banktrain$housing=factor(banktrain$housing)
banktrain$loan=factor(banktrain$loan)
banktrain$contact=factor(banktrain$contact)
banktrain$month=factor(banktrain$month)
banktrain$poutcome=factor(banktrain$poutcome)
# Get descriptive stats for data
summary(banktrain)

We have used caret package (Classification And REgression Training) for data splitting. Package ‘caret’ contains a set of functions to streamline the process for creating predictive models. Function createDataPartition() causes balanced splits of the data. If the class argument to this function is a factor (as in this case), the random sampling occurs within each class and should preserve the overall class distribution of the data. The function requires a random number (set.seed) to be set. We get a training data-set, banktrain, and test data-set, bank-test. We also indicate above which of the fields are categorical (factor) in nature. The summary output (descriptive statistics) are as follows:

> summary(banktrain)

      age                 job           marital          education
 Min.   :18.00   blue-collar:6786   divorced: 3622   primary  : 4722
 1st Qu.:33.00   management :6613   married :19058   secondary:16298
 Median :39.00   technician :5369   single  : 8969   tertiary : 9329
 Mean   :40.92   admin.     :3602                    unknown  : 1300
 3rd Qu.:48.00   services   :2929
 Max.   :95.00   retired    :1574
                 (Other)    :4776

 default        balance       housing      loan            contact
 no :31076   Min.   : -8019   no :14164   no :26564   cellular :20468
 yes:  573   1st Qu.:    75   yes:17485   yes: 5085   telephone: 2056
             Median :   454                           unknown  : 9125
             Mean   :  1360
             3rd Qu.:  1432
             Max.   :102127

      day            month         duration         campaign
 Min.   : 1.00   may    :9530   Min.   :   0.0   Min.   : 1.000
 1st Qu.: 8.00   jul    :4864   1st Qu.: 103.0   1st Qu.: 1.000
 Median :16.00   aug    :4376   Median : 181.0   Median : 2.000
 Mean   :15.78   jun    :3776   Mean   : 259.9   Mean   : 2.747
 3rd Qu.:21.00   nov    :2776   3rd Qu.: 321.0   3rd Qu.: 3.000
 Max.   :31.00   apr    :2070   Max.   :4918.0   Max.   :55.000
                 (Other):4257

     pdays           previous         poutcome       y
 Min.   : -1.00   Min.   : 0.000   failure: 3410   no :27946
 1st Qu.: -1.00   1st Qu.: 0.000   other  : 1275   yes: 3703
 Median : -1.00   Median : 0.000   success: 1057
 Mean   : 39.82   Mean   : 0.568   unknown:25907
 3rd Qu.: -1.00   3rd Qu.: 0.000
 Max.   :854.00   Max.   :58.000

Note from above that variation in age is from 18 to 95 but third quartile at 48 years implies that younger persons outnumber older persons. Also, (Other) category of Job, groups such categories as ‘student’. ‘self-employed’
“admin.”,”unknown”,”unemployed”,”management”,”housemaid”,”entrepreneur”,”student”, “housemaid”, “unemployed” and “unknown”.
Any of these may be important as we will shortly see. As to Credit in default, only 2% are defaulters. Since 98% are non-defaulters the data may be of no use in making predictions. Also note the statistics about the class variable. ‘y’. Around 28000 have not subscribed to term deposit and only 3703 have subscribed. Model will have to work really hard to successfully find pattern. This said, let us now proceed with building building model. We will use Generalised Linear Modelling . In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function. Link function in our case is logit function.

# Use glm function, with y as dependent variable and all others as independent variables (IVs), written
#  in the usual manner with '+' sign.
logitModel=glm(y~poutcome+previous+pdays+campaign+duration+month+day+contact+loan+housing+balance+default+education+marital+job+age, data=banktrain,family=binomial(logit))
# As all variables besides, y, are IVs, we could have used a shorter form (with dot), as:
logitModel=glm(y~., data=banktrain,family=binomial(logit))
# Get model summary
summary(logitModel)

The model summary is shown next.

> summary(logitModel)

Call:
glm(formula = y ~ ., family = binomial(logit), data = banktrain)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-5.6771  -0.3754  -0.2549  -0.1505   3.4325

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)
(Intercept)        -2.665e+00  2.198e-01 -12.122  < 2e-16 ***
age                 2.862e-04  2.633e-03   0.109  0.91346
jobblue-collar     -2.465e-01  8.752e-02  -2.817  0.00485 **
jobentrepreneur    -2.497e-01  1.471e-01  -1.698  0.08952 .
jobhousemaid       -3.159e-01  1.582e-01  -1.997  0.04582 *
jobmanagement      -7.378e-02  8.841e-02  -0.835  0.40397
jobretired          2.571e-01  1.167e-01   2.203  0.02759 *
jobself-employed   -2.568e-01  1.343e-01  -1.912  0.05586 .
jobservices        -1.698e-01  1.007e-01  -1.686  0.09180 .
jobstudent          3.429e-01  1.320e-01   2.597  0.00939 **
jobtechnician      -1.166e-01  8.358e-02  -1.395  0.16315
jobunemployed      -4.978e-02  1.333e-01  -0.373  0.70885
jobunknown         -1.998e-01  2.718e-01  -0.735  0.46233
maritalmarried     -1.959e-01  7.002e-02  -2.797  0.00516 **
maritalsingle       8.012e-02  8.019e-02   0.999  0.31779
educationsecondary  1.591e-01  7.670e-02   2.074  0.03809 *
educationtertiary   3.350e-01  8.975e-02   3.733  0.00019 ***
educationunknown    2.744e-01  1.226e-01   2.238  0.02524 *
defaultyes          7.072e-03  1.878e-01   0.038  0.96996
balance             1.348e-05  6.174e-06   2.183  0.02903 *
housingyes         -6.644e-01  5.238e-02 -12.684  < 2e-16 ***
loanyes            -4.697e-01  7.231e-02  -6.495 8.30e-11 ***
contacttelephone   -8.801e-02  8.794e-02  -1.001  0.31689
contactunknown     -1.637e+00  8.774e-02 -18.655  < 2e-16 ***
day                 8.951e-03  2.977e-03   3.006  0.00265 **
monthaug           -6.874e-01  9.428e-02  -7.291 3.09e-13 ***
monthdec            7.854e-01  2.005e-01   3.917 8.96e-05 ***
monthfeb           -8.528e-02  1.060e-01  -0.805  0.42094
monthjan           -1.354e+00  1.526e-01  -8.875  < 2e-16 ***
monthjul           -7.605e-01  9.232e-02  -8.237  < 2e-16 ***
monthjun            4.912e-01  1.122e-01   4.379 1.19e-05 ***
monthmar            1.765e+00  1.416e-01  12.468  < 2e-16 ***
monthmay           -3.817e-01  8.683e-02  -4.396 1.10e-05 ***
monthnov           -7.817e-01  1.002e-01  -7.799 6.25e-15 ***
monthoct            9.693e-01  1.286e-01   7.536 4.83e-14 ***
monthsep            8.515e-01  1.422e-01   5.988 2.12e-09 ***
duration            4.104e-03  7.584e-05  54.122  < 2e-16 ***
campaign           -9.137e-02  1.219e-02  -7.494 6.68e-14 ***
pdays               2.919e-04  3.644e-04   0.801  0.42300
previous            3.424e-02  1.159e-02   2.954  0.00313 **
poutcomeother       8.240e-03  1.093e-01   0.075  0.93991
poutcomesuccess     2.228e+00  9.802e-02  22.729  < 2e-16 ***
poutcomeunknown     2.050e-03  1.145e-01   0.018  0.98572 

--- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
(Dispersion parameter for binomial family taken to be 1)

 Null deviance: 22845  on 31648  degrees of freedom
 Residual deviance: 15129  on 31606  degrees of freedom

 AIC: 15215
 Number of Fisher Scoring iterations: 6

 

Coefficients of independent variables in the linear equation are given above. Wald’s z-test is perfromed to bring out which of the coefficients are significantly different from zero. Those coefficients which have a dot or a star at the end are significantly different, others are not. Thus ‘pdays’ and ‘age’ can be safely removed from the model. Further note that among three poutcome(s), two of them, poutcomeunknown and poutcomeother, are not significant, only poutcomesuccess is. Overall, is poutcome significant? This can be gauged using Wald’s test available in ‘aod’ package.

# Conduct wald's test on poutcome at serial number 41 to 43. Coeff (Terms), 41
#  to 43 in the coeff table are jointly tested
> library(aod)
> wald.test(b = coef(logitModel), Sigma = vcov(logitModel), Terms = 41:43)
Wald test:
----------

Chi-squared test:
X2 = 731.2, df = 3, P(> X2) = 0.0

With a p value of 0.0, poutcome is significant. A similar test for maritalsingle (non-significant, term: 15) and maritalmarried (significant, term:14), yields the result that jointly they are significant. Should you construct another model after removing non-significant variables, note then the AIC value. The Akaike information criterion (AIC) is a measure of the relative quality of a statistical model, for a given set of data. As such, AIC provides a means for model selection. On its own absolute value of AIC is of no use. You can use stepAIC command available in the MASS library to automatically carry out a restricted search for the “best” model, as measured by AIC. stepAIC accepts both categorical and numerical variables. It does not search all possible subsets of variables, but rather it searches in a specific sequence determined by the order of the variables in the formula of the model object.

Deviance is an important measure of fit of a model. It is also used to compare models. The deviance of a model is 2 times the log likelihood of the data under the model plus a constant that would be the same for all models for the same data, and so can be ignored. The larger the deviance, the worse the fit. Also two deviances that of null model (i.e. model with only intercept term) and model with independent variables can be compared to find out if the fit is significantly better. The deviance of difference has chi-square statistic. Residual deviance of 15129 with 31606 degrees of freedom has p-value of 0 as can be seen below and indicating good model fit.

> pchisq(15129,df=31606)
[1] 0
> 

The model, that we constructed, has the form:

log(p/(1-p) = intercept + c1*x1 + c2*x2+ c3*x3…..

The ratio, p/(1-p) is the odds ratio. To calculate odds ratio, we can exponentiate coefficients.


# Find odds ratio as:
exp(coef(logitModel))

# Or better still with 95% confidence interval, as:
exp(cbind(OddsRatio=coef(logitModel),confint(logitModel)))

The result can is as below:

                    OddsRatio      2.5 %     97.5 %
(Intercept)        0.06961723 0.04519572  0.1069932
age                1.00028622 0.99513022  1.0054570
jobblue-collar     0.78152131 0.65841888  0.9279574
jobentrepreneur    0.77904585 0.58115945  1.0347133
jobhousemaid       0.72911497 0.53184047  0.9892007
jobmanagement      0.92887269 0.78145798  1.1051934
jobretired         1.29321532 1.02837065  1.6251009
jobself-employed   0.77351038 0.59275182  1.0037319
jobservices        0.84381316 0.69202032  1.0271966
jobstudent         1.40906871 1.08643664  1.8231978
jobtechnician      0.88998429 0.75572355  1.0487571
jobunemployed      0.95143826 0.73071041  1.2325361
jobunknown         0.81890624 0.47112825  1.3708220
maritalmarried     0.82212563 0.71735677  0.9439807
maritalsingle      1.08341237 0.92644281  1.2687139
educationsecondary 1.17242630 1.00961551  1.3638226
educationtertiary  1.39792771 1.17313231  1.6678428
educationunknown   1.31578091 1.03286746  1.6707020
defaultyes         1.00709671 0.68672796  1.4360142
balance            1.00001348 1.00000121  1.0000255
housingyes         0.51455711 0.46425005  0.5700844
loanyes            0.62520767 0.54174148  0.7193315
contacttelephone   0.91574909 0.76936952  1.0861102
contactunknown     0.19461557 0.16368937  0.2308932
day                1.00899075 1.00311967  1.0148967
monthaug           0.50289086 0.41809127  0.6050744
monthdec           2.19320466 1.47725368  3.2441840
monthfeb           0.91825609 0.74581870  1.1299667
monthjan           0.25821116 0.19045604  0.3465018
monthjul           0.46745301 0.39008422  0.5602142
monthjun           1.63432029 1.31182006  2.0364832
monthmar           5.84128618 4.42688513  7.7123317
monthmay           0.68267604 0.57604036  0.8096684
monthnov           0.45760929 0.37573926  0.5566459
monthoct           2.63614726 2.04797977  3.3911321
monthsep           2.34312269 1.77218907  3.0948956
duration           1.00411286 1.00396456  1.0042631
campaign           0.91268310 0.89074341  0.9343410
pdays              1.00029198 0.99957596  1.0010053
previous           1.03482952 1.01096152  1.0581795
poutcomeother      1.00827444 0.81254711  1.2474276
poutcomesuccess    9.27986285 7.66483355 11.2562505
poutcomeunknown    1.00205188 0.80113330  1.2553803

It indicates, for example, that with a unit increase in number of retired persons odds increase by a factor of 1.29 that a term deposit would be taken. In the same breath, value of odds ration being less than 1 is, in fact, the value of odds failure. On this see Stata FAQ here.

Recall that we had split the bank dataset into banktrain and banktest (75:25). For constructing model, i.e. for training purposes, we have used banktrain dataset. We can now test how good our model is using banktest dataset and drawing up confusion matrix.

# Calculate probabilities
# We use only columns 1 to 16 of banktest dataset as column 17
# contains the class target, y, to be predicted
# First probabilities of each event
probabilities <- predict.glm(logitModel, banktest[,1:16], type="response")
# Convert probabilities into class values. 
# 0.5 is the threshold value.
predictions <- cut(probabilities, c(-Inf,0.5,Inf), labels=c("no","yes"))
# Draw up confusion matrix
table(predictions, banktest$y)

The confusion matrix is given below:

predictions    no   yes   
        no  11702  1068    
        yes   274   518

There is, therefore, a prediction accuracy of 91.6% in regard to prediction of ‘no’s and 65.4% in regard to prediction of ‘yes’s. We can get a similar result using CVbinary function of DAAG package as below:



> library(DAAG)
> CVbinary(logitModel)

Fold:  5 9 1 6 8 7 3 10 2 4
Internal estimate of accuracy = 0.902
Cross-validation estimate of accuracy = 0.902

You can also examine collinearity among variables and eliminate highly correlated variables as below:

X<-model.matrix(logitModel)[,-1]
round(cor(X),2)

Some good references for logistic regression using R are here: [1] and [2]

In Part-IV we will use mahout from command line to build logit model.

Logistic Regression using RapidMiner Studio 6, R and Mahout command line–Part II

March 6, 2014

In Part-I of the blog, we have described the data set for logistic modelling. In this blog, we proceed first with setting up RapidMiner Studio for conducting the experiment and then discuss results. RapidMiner Studio 6 can be downloaded from here. The Starter and (open-source) community versions of RapidMiner (rapidminer 5.3.015) are free. The free versions limitation is that complete data should be in memory for analysis. Being Java based, RapidMiner can be run in either Windows or in Linux. There are no installation steps. Declare JAVA_HOME, download and unzip the package and it is ready for work.
Inside scripts folder, look for file: RapidMinerGUI. There are two of them. Select one with .sh or .bat extension as appropriate for your OS; double-click it to start RapidMiner. In what follows, some little familiarity with RapidMiner operators will be desirable.

Using the bash script as mentioned in Part-I, the file bank-full.csv was ripped into two files; around 4000 randomly selected records were stored in  tovalidate.csv file and training.csv file was left with the remaining around 41000 records.

Setting up RapidMiner

In RapidMiner we will:
a. Import files training.csv and tovalidate.csv
b. Given training data (training.csv), we will build a logistic regression model. (Operator: Logistic Regression)
c. Training data will be split in 70:30 ratio. 70% of records go into building model, and 30% records are used to gauge model’s performance. (Operators: Split-validation; Performance; Apply model)
d. At the same time, we use this model to classify for us hitherto unclassified data (data in tovalidate.csv file). Operator: Apply model

Import into RapidMiner repository, data file training.csv as shown in the figures below.

2

Figure–I: In the lower-left repository panel of RapidMiner, click on the down-arrow to begin import of CSV file

While importing data, for ‘y’ data-field, change the ‘attribute’ to ‘label’. Field ‘y’ is the classification field; it lists whether our efforts were successful in getting desired term-deposit from client or not. It is the target field of interest.  Our efforts will be to predict which future customers are likely to  ‘convert’ and make term-deposit with us. Change its data-type to binominal (it is not binomial). For the rest of the fields, the default attributes and data-type values that the import wizard selects are OK (see figure below).

1

Figure-II: For last field ‘y’, change attribute to label and data-type to binominal instead of integer (click image to have a clearer view)

Similarly, import tovalidate.csv. However, as we are interested in making predictions about ‘y’ field in this file, do not import ‘y’ field (deselect it, as below).

2

Figure-III: While importing tovalidate.csv file, de-select the last (target) field, ‘y’.  (Click image to have a clearer view)

Drag, just imported data-source, training.csv, from RapidMiner repository into the middle-window (Process window). Search for ‘Split Validation‘ operator (Evaluation–>Validation–>Split Validation) in the top-left ‘Operators’ box and drag it into Process window.

2

Figure–IV: Split-Validation operator. Drag it into Process window.

Connect the two operators as shown in Figure-V (out port is connected to tra port).

2

Figure-V: The operators are: Split-validation and Apply Model. The data sources are training.csv and tovalidate.csv (click image to have a clearer view)

Next, double-click the lower-right icon (two-overlapping squares) within the Split-validation operator to open another Validation window where model building and validation actions will be performed (See below).

2

Figure-VI: Split-validation window. Drag Logistic regression operator, Apply Model operator and Performance operator to the two parts of window as shown. (click image to enlarge image).

Search for Logistic Regression operator and drag it into the left-panel. Drag Apply Model ( Modeling->Model Application->Apply Model) and Performance (Evaluation->Performance) operators to right panel. Connect operators as shown. Shift back to Process window (by clicking on Process just above the left panel). From the repository (Figure I), drag imported tovalidate.csv data source and also (from Operators window) drag Apply Model operator into the center Process window. Complete all port connections as shown in Figure V.

What we have done is this: In the Process window, Split-validation operator splits the incoming data stream (from training.csv) in 70:30 ratio. It uses the 70% stream to build a logistic regression model (Figure VI, left panel) and to the remaining 30%, it applies this model to evaluate its performance (Figure VI, right panel). All this model building and testing happens within Split Validation operator itself. The output of split-validation operator contains two outputs: the constructed logit model (mod output port) and performance evaluation (ave output port). The output of mod port is fed into Apply model‘s input mod port. Another input port of Apply model receives data from tovalidate data-source. Apply Model‘s two output ports, one with labelled (classified) data (lab port) and the other the model itself (mod port) are connected to inputs of Results window. Similarly ave port of split-validation operator is connected to Results window.

Model Results

Run the setup now. Model building and testing is quite time consuming. On my i7, 8GB laptop, it took 40 minutes. So be patient. There is a whole lot of output in the Results window. To be brief, the logistic model is as follows:

2

Figure VII: The logistic model. The model is a linear relationship between log of odds to weighted independent variables.

And, Confusion matrix is as below. Confusion matrix displays how many of predicted values matched the actual values when cross-validation tests were performed (by split-validation operator). For example, from among records that were predicted with y classificatied as 1, the correct predictions were 10724 and 1205 were incorrect. The confusion matrix shows that in regard to prediction of y= 1, model accuracy is 90% and in regard to prediction of y=2, model accuracy is 66%.

2

Figure VIII: Confusion matrix (click image to enlarge)

Which measure of precision to rely on? And why is there so much difference? The answer to first question depends upon which prediction is more important for you. y=2 or y=1. Prediction for y=1 has more accuracy than for 2. The answer to second question can be seen in the number of cross-validated training cases with y=2. The number of (cross-validated) training cases with y=2 is much less (=1490) than the number of training cases with y=1 classification (=10873). The model, therefore, over-trains (or over-fits) itself when y=1. Over-fitting implies that the model even tries to fit to the noise that appears in y=1 record cases. On the obverse side, model building does not get enough training records to train itself when y=2.  Possibly, by fine-tuning ‘C’ parameter of the Logistic model operator (for example, reducing it from 1 to 0.5) we can tell the model to do less over-fitting and more generalizations. The ROC curve is given below.  Receiver Operating Characteristic curve (or ROC curve)  is a plot of the true positive rate against the false positive rate for the different possible cut-points of a validation test. Area under the ROC curve (AUC) value of 0.5 is worthless meaning the model did not learn anything while an area of 1 means perfect model. AUC value of 0.856 (nearer to 1) is indicative of good fit.

2

Figure IX–:AUC (Area ungnder the ROC curve) is 0.856. This is indicative of good model.

Recall that in the main Process window we are also passing through the logistic model as yet unclassified data (tovalidate.csv minus the y data-field). So, what predictions are there for this data? RapidMiner reports the predictions as in Figure below.

3

Figure X–Green colored column is predicted y. Prediction is based upon calculations in columns Confidence(1), ie confidence in y being 1 and Confidence(2) ie confidence in y being 2.  (Click image to enlarge for clarity)

In Part-III of this series we will build the model using R.

Logistic Regression using RapidMiner Studio 6, R and Mahout command line–Part I

March 5, 2014

Logistic Regression is a multivariate statistical technique often used in predictive analysis where the dependent variable to be predicted is dichotomous but depends upon a number of independent variables. The independent variables may be continuous or categorical. Logistic regression, therefore, differs from normal (linear) regression analysis in that unlike in linear regression analysis where the dependent variable must be continuous in nature, in logistic regression the dependent variable must be categorical. In this respect logistic regression and discriminant analysis deal with the same kind of problems and have the same objective. However, assumptions in the discriminant analysis are more stringent than in logistic regression.
Logistic Regression is very popular in epidemiological studies and in marketing. It answers such questions as given certain body-metrics how likely is it that a patient will have or be susceptible to certain disease. It is also widely used in computing TRISS (Trauma and Injury Severity Score) to predict mortality among severely injured patients. In marketing, it can answer such questions as what steps are more likely to retain customers or how customers propensity to purchase could be enhanced or which customers are likely to default on home mortgage loans. In this four-part series, we will use three open-source tools RapidMiner Studio, R and Mahout (command line) to perform logistic regression on a set of marketing data.
Problem and the data: We will make use of Bank Marketing Data Set available at UCI Machine Learning Repository. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to assess if the product (bank term deposit) would be (or not) subscribed. The data has been analysed by S. Moro, R. Laureano and P. Cortez using three techniques: Naïve Bayes. Decision Trees and Support Vector Mechanics (SCM). Logistic Regression model, however, was not constructed. We will do that now. But first the data. Data description is as below:

 

Data heading Description
age numeric
job type of job (categorical: admin., unknown, unemployed, management, housemaid, entrepreneur, student, blue-collar, self-employed, retired, technician, services)
marital marital status : married, divorced, single
education unknown, secondary, primary, tertiary
default yes, no
balance numeric
housing housing: yes,no
loan has personal loan: yes,no
contact contact communication : unknown, telephone, cellular
day day: last contact day of the month (numeric)
month last contact month of year : jan, feb, mar, …, nov, dec
duration last contact duration, in seconds (numeric)
campaign number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays number of days that passed by after the client was last contacted from a previous campaign (numeric)
previous number of contacts performed before this campaign and for this client (numeric)
poutcome outcome of the previous marketing campaign : unknown, other, failure, success

A small extract from data-set is shown below. It has 45211 records. The last field ‘y’ is the success or failure of efforts in regard to client taking a Term deposit. It is a binary field and it is this field we are interested in making predictions about for future clients.

"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"
33;"entrepreneur";"married";"secondary";"no";2;"yes";"yes";"unknown";5;"may";76;1;-1;0;"unknown";"no"
47;"blue-collar";"married";"unknown";"no";1506;"yes";"no";"unknown";5;"may";92;1;-1;0;"unknown";"no"
33;"unknown";"single";"unknown";"no";1;"no";"no";"unknown";5;"may";198;1;-1;0;"unknown";"no"
35;"management";"married";"tertiary";"no";231;"yes";"no";"unknown";5;"may";139;1;-1;0;"unknown";"no"
28;"management";"single";"tertiary";"no";447;"yes";"yes";"unknown";5;"may";217;1;-1;0;"unknown";"no"

The categorical data was transformed using a series of ‘sed‘ statements (full shell script is given next). Few lines of transformed data are as shown below. As words ‘unknown’, ‘yes’, ‘no’ appear as values in many data fields, for the sake of easy transformation, codes for these values were kept uniform throughout.

"age","job","marital","education","default","balance","housing","loan","contact","day","month","duration","campaign","pdays","previous","poutcome","y"
58,4,1,4,1,2143,2,1,2,5,5,261,1,-1,0,2,1
33,6,1,3,1,2,2,2,2,5,5,76,1,-1,0,2,1
47,8,1,2,1,1506,2,1,2,5,5,92,1,-1,0,2,1
33,2,3,2,1,1,1,1,2,5,5,198,1,-1,0,2,1
35,4,1,4,1,231,2,1,2,5,5,139,1,-1,0,2,1
28,4,3,4,1,447,2,2,2,5,5,217,1,-1,0,2,1

The data transformation bash shell script is as below. It is liberally commented for easy reading.

#!/bin/sh
#
# Your home folder
cd ~
homefolder=`pwd`
# Your data folder & data file
datadir="$homefolder/mahout"
trainfile="bank-full.csv"
cd $datadir
# Delete previous temp file
rm -f temp.txt
######bank-full.csv##########
# Transformig first job type (for example unknown is replaced by 2 and unemployed by 3)
sed 's/\"admin\.\"/1/g ; s/\"unknown\"/2/g ; s/\"unemployed\"/3/g ; s/\"management\"/4/g ; s/\"housemaid\"/5/g ; s/\"entrepreneur\"/6/g ; s/\"student\"/7/g ;s/\"blue-collar\"/8/g ; s/\"self-employed\"/9/g ;s/\"retired\"/10/g ; s/\"technician\"/11/g ; s/\"services\"/12/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming next marital status
sed 's/\"married\"/1/g ; s/\"divorced\"/2/g ; s/\"single\"/3/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming now education category
sed 's/\"unknown\"/2/g ; s/\"secondary\"/3/g ; s/\"primary\"/1/g ; s/\"tertiary\"/4/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming credit-in-default, has-housing-loan, has-personal-loan
sed 's/\"yes\"/1/g ; s/\"no\"/0/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming communication type
sed 's/\"unknown\"/2/g ; s/\"telephone\"/3/g ; s/\"cellular\"/1/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming month
sed 's/\"jan\"/1/g ; s/\"feb\"/2/g ;s/\"mar\"/3/g ; s/\"apr\"/4/g ;s/\"may\"/5/g ;s/\"jun\"/6/g ; s/\"jul\"/7/g ; s/\"aug\"/8/g ; s/\"sep\"/9/g ; s/\"oct\"/10/g ; s/\"nov\"/11/g ;s/\"dec\"/12/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
#Transforming campaign success
sed 's/\"unknown\"/2/g ; s/\"other\"/3/g ; s/\"failure\"/4/g ;  s/\"success\"/1/g'  $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile

# Lastly remove semicolon separator by comma separator
sed 's/;/,/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
echo -n "Press a key to finish"; read x

The bank data set contains two csv files: ‘bank-full.csv’ containing 45211 records and another file ‘bank.csv’ containing around 4500 records. Records in file ‘bank.csv’ have been selected on random basis from bank-full.csv. The purpose of records in file bank.csv is to validate logit model created from the full-set. However, as these records also appear in bank-full.csv, we have written a script to randomly select a pre-specified number of records from bank-full.csv (without replacement) and create two files: training.csv and tovalidate.csv. Both contain different records without any duplicacy. File, training.csv, is used to build model and file, tovalidate.csv, to test the model. The shell script is given below:

#!/bin/bash
#
#
# Generates cross-validation file to test logistic regression model
# The script randomly picks up n-lines (n to be specified) from the
# given file. It creates two files: training.csv and tovalidate.csv
# The original file remains undisturbed.
#

# Your home folder
cd ~
homefolder=`pwd`
# Your data folder & data file

datadir="$homefolder/mahout"
originalfile="bank-full.csv"
samplefile="tovalidate.csv"

cd $datadir
# Delete earlier sample file & recreate it
rm -f $datadir/$samplefile
touch $datadir/$samplefile
# Delete earlier training file and recreate it
rm -f $datadir/training.csv
cp $datadir/$originalfile $datadir/training.csv
# Delete temp file, if it exists
rm -f temp.txt

# Number of lines in given file
nooflines=`sed -n '$=' $datadir/$originalfile`

echo "No of lines in $datadir/training.csv  are: $nooflines"
echo -n "Specify your sample size (recommended 10% of orig file)? " ; read samplesize

# If nothing specified, default is 50 records
if [ -z $samplesize ] ; then
echo "Default value of sample size = 10"
samplesize=10
fi

# Bash loop to generate random numbers
echo "Will generate random numbers between 2 to $samplesize"
echo "Original file size is $nooflines lines"
echo "Wait....."
for (( i = 1 ; i <= $samplesize ; ++i ));
  do
     arr[i]=$(( ( RANDOM % $nooflines )  + 2 ));
     lineno="${arr[i]}"  # Append lines to sample file
     sed -n "${lineno}p" $datadir/training.csv >> $datadir/$samplefile
     # Delete the same line from training.csv
     sed "${lineno}d" $datadir/training.csv > temp.txt
     mv temp.txt $datadir/training.csv
done
trlines=`sed -n '$=' $datadir/training.csv`
samplines=`sed -n '$=' $datadir/$samplefile`

# Delete temp file
rm -f temp.txt

echo "---------------------------------"
echo "Lines in sample file $samplefile: $samplines"
echo "Lines in training file training.csv : $trlines" ;
echo "Data folder: $datadir"
echo "---------------------------------"

There is no missing data in the full data set. In Part-II we will build logistic model using RapidMiner Studio.