Posts Tagged ‘logistic regression using R’

Logistic Regression using RapidMiner Studio 6, R and Mahout command line–Part III

March 10, 2014

This is Part-III of the series. In Part-I we described the data under analysis and in Part-II we used RapidMiner to draw up a logistic regression model. We now use R for the purpose.

R is a command/program oriented framework. It, therefore, has a steeper learning curve than other menu based statistical tools. However, once the initial hurdle is crossed, one is rewarded with a very rich set of a variety of statistical processing tools backed up by plenty of documentation. In Part-I of this series we have described the data for which we have to draw up a prediction model. In Part-II we showed how to use RapidMiner for the purpose. In this Part, we will use R to build logistic model.

There is no need to transform bank data set as we did in Part-I (i.e. transform categorical values into integers). We just have to point out to R as to which variables are categorical and R will take care of them by itself while summarising the data or while building the model. We will proceed in three steps. In Step 1, we import the data and look at its descriptive statistics. In Step 2, we build the model and in Step 3, we look at its performance and use the model to make predictions.

# Read semicolon separated data.
# The data is read as a data-frame and assigned to bankdata
# We will partition the data-set in 70:30 ratio
# for training purposes and for validation purposes.
# Import library 'caret'
# Indicate the classification variable
# Split now
# Get training data-frame
# Get data-frame for testing
# Check number of rows in split-data set
# Alternatively, check dimensions of both datasets
# Do both data-sets contain column headings. Check.
# Indicate now which variables are categorical variables
# Get descriptive stats for data

We have used caret package (Classification And REgression Training) for data splitting. Package ‘caret’ contains a set of functions to streamline the process for creating predictive models. Function createDataPartition() causes balanced splits of the data. If the class argument to this function is a factor (as in this case), the random sampling occurs within each class and should preserve the overall class distribution of the data. The function requires a random number (set.seed) to be set. We get a training data-set, banktrain, and test data-set, bank-test. We also indicate above which of the fields are categorical (factor) in nature. The summary output (descriptive statistics) are as follows:

> summary(banktrain)

      age                 job           marital          education
 Min.   :18.00   blue-collar:6786   divorced: 3622   primary  : 4722
 1st Qu.:33.00   management :6613   married :19058   secondary:16298
 Median :39.00   technician :5369   single  : 8969   tertiary : 9329
 Mean   :40.92   admin.     :3602                    unknown  : 1300
 3rd Qu.:48.00   services   :2929
 Max.   :95.00   retired    :1574
                 (Other)    :4776

 default        balance       housing      loan            contact
 no :31076   Min.   : -8019   no :14164   no :26564   cellular :20468
 yes:  573   1st Qu.:    75   yes:17485   yes: 5085   telephone: 2056
             Median :   454                           unknown  : 9125
             Mean   :  1360
             3rd Qu.:  1432
             Max.   :102127

      day            month         duration         campaign
 Min.   : 1.00   may    :9530   Min.   :   0.0   Min.   : 1.000
 1st Qu.: 8.00   jul    :4864   1st Qu.: 103.0   1st Qu.: 1.000
 Median :16.00   aug    :4376   Median : 181.0   Median : 2.000
 Mean   :15.78   jun    :3776   Mean   : 259.9   Mean   : 2.747
 3rd Qu.:21.00   nov    :2776   3rd Qu.: 321.0   3rd Qu.: 3.000
 Max.   :31.00   apr    :2070   Max.   :4918.0   Max.   :55.000

     pdays           previous         poutcome       y
 Min.   : -1.00   Min.   : 0.000   failure: 3410   no :27946
 1st Qu.: -1.00   1st Qu.: 0.000   other  : 1275   yes: 3703
 Median : -1.00   Median : 0.000   success: 1057
 Mean   : 39.82   Mean   : 0.568   unknown:25907
 3rd Qu.: -1.00   3rd Qu.: 0.000
 Max.   :854.00   Max.   :58.000

Note from above that variation in age is from 18 to 95 but third quartile at 48 years implies that younger persons outnumber older persons. Also, (Other) category of Job, groups such categories as ‘student’. ‘self-employed’
“admin.”,”unknown”,”unemployed”,”management”,”housemaid”,”entrepreneur”,”student”, “housemaid”, “unemployed” and “unknown”.
Any of these may be important as we will shortly see. As to Credit in default, only 2% are defaulters. Since 98% are non-defaulters the data may be of no use in making predictions. Also note the statistics about the class variable. ‘y’. Around 28000 have not subscribed to term deposit and only 3703 have subscribed. Model will have to work really hard to successfully find pattern. This said, let us now proceed with building building model. We will use Generalised Linear Modelling . In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function. Link function in our case is logit function.

# Use glm function, with y as dependent variable and all others as independent variables (IVs), written
#  in the usual manner with '+' sign.
logitModel=glm(y~poutcome+previous+pdays+campaign+duration+month+day+contact+loan+housing+balance+default+education+marital+job+age, data=banktrain,family=binomial(logit))
# As all variables besides, y, are IVs, we could have used a shorter form (with dot), as:
logitModel=glm(y~., data=banktrain,family=binomial(logit))
# Get model summary

The model summary is shown next.

> summary(logitModel)

glm(formula = y ~ ., family = binomial(logit), data = banktrain)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-5.6771  -0.3754  -0.2549  -0.1505   3.4325

                     Estimate Std. Error z value Pr(>|z|)
(Intercept)        -2.665e+00  2.198e-01 -12.122  < 2e-16 ***
age                 2.862e-04  2.633e-03   0.109  0.91346
jobblue-collar     -2.465e-01  8.752e-02  -2.817  0.00485 **
jobentrepreneur    -2.497e-01  1.471e-01  -1.698  0.08952 .
jobhousemaid       -3.159e-01  1.582e-01  -1.997  0.04582 *
jobmanagement      -7.378e-02  8.841e-02  -0.835  0.40397
jobretired          2.571e-01  1.167e-01   2.203  0.02759 *
jobself-employed   -2.568e-01  1.343e-01  -1.912  0.05586 .
jobservices        -1.698e-01  1.007e-01  -1.686  0.09180 .
jobstudent          3.429e-01  1.320e-01   2.597  0.00939 **
jobtechnician      -1.166e-01  8.358e-02  -1.395  0.16315
jobunemployed      -4.978e-02  1.333e-01  -0.373  0.70885
jobunknown         -1.998e-01  2.718e-01  -0.735  0.46233
maritalmarried     -1.959e-01  7.002e-02  -2.797  0.00516 **
maritalsingle       8.012e-02  8.019e-02   0.999  0.31779
educationsecondary  1.591e-01  7.670e-02   2.074  0.03809 *
educationtertiary   3.350e-01  8.975e-02   3.733  0.00019 ***
educationunknown    2.744e-01  1.226e-01   2.238  0.02524 *
defaultyes          7.072e-03  1.878e-01   0.038  0.96996
balance             1.348e-05  6.174e-06   2.183  0.02903 *
housingyes         -6.644e-01  5.238e-02 -12.684  < 2e-16 ***
loanyes            -4.697e-01  7.231e-02  -6.495 8.30e-11 ***
contacttelephone   -8.801e-02  8.794e-02  -1.001  0.31689
contactunknown     -1.637e+00  8.774e-02 -18.655  < 2e-16 ***
day                 8.951e-03  2.977e-03   3.006  0.00265 **
monthaug           -6.874e-01  9.428e-02  -7.291 3.09e-13 ***
monthdec            7.854e-01  2.005e-01   3.917 8.96e-05 ***
monthfeb           -8.528e-02  1.060e-01  -0.805  0.42094
monthjan           -1.354e+00  1.526e-01  -8.875  < 2e-16 ***
monthjul           -7.605e-01  9.232e-02  -8.237  < 2e-16 ***
monthjun            4.912e-01  1.122e-01   4.379 1.19e-05 ***
monthmar            1.765e+00  1.416e-01  12.468  < 2e-16 ***
monthmay           -3.817e-01  8.683e-02  -4.396 1.10e-05 ***
monthnov           -7.817e-01  1.002e-01  -7.799 6.25e-15 ***
monthoct            9.693e-01  1.286e-01   7.536 4.83e-14 ***
monthsep            8.515e-01  1.422e-01   5.988 2.12e-09 ***
duration            4.104e-03  7.584e-05  54.122  < 2e-16 ***
campaign           -9.137e-02  1.219e-02  -7.494 6.68e-14 ***
pdays               2.919e-04  3.644e-04   0.801  0.42300
previous            3.424e-02  1.159e-02   2.954  0.00313 **
poutcomeother       8.240e-03  1.093e-01   0.075  0.93991
poutcomesuccess     2.228e+00  9.802e-02  22.729  < 2e-16 ***
poutcomeunknown     2.050e-03  1.145e-01   0.018  0.98572 

--- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
(Dispersion parameter for binomial family taken to be 1)

 Null deviance: 22845  on 31648  degrees of freedom
 Residual deviance: 15129  on 31606  degrees of freedom

 AIC: 15215
 Number of Fisher Scoring iterations: 6


Coefficients of independent variables in the linear equation are given above. Wald’s z-test is perfromed to bring out which of the coefficients are significantly different from zero. Those coefficients which have a dot or a star at the end are significantly different, others are not. Thus ‘pdays’ and ‘age’ can be safely removed from the model. Further note that among three poutcome(s), two of them, poutcomeunknown and poutcomeother, are not significant, only poutcomesuccess is. Overall, is poutcome significant? This can be gauged using Wald’s test available in ‘aod’ package.

# Conduct wald's test on poutcome at serial number 41 to 43. Coeff (Terms), 41
#  to 43 in the coeff table are jointly tested
> library(aod)
> wald.test(b = coef(logitModel), Sigma = vcov(logitModel), Terms = 41:43)
Wald test:

Chi-squared test:
X2 = 731.2, df = 3, P(> X2) = 0.0

With a p value of 0.0, poutcome is significant. A similar test for maritalsingle (non-significant, term: 15) and maritalmarried (significant, term:14), yields the result that jointly they are significant. Should you construct another model after removing non-significant variables, note then the AIC value. The Akaike information criterion (AIC) is a measure of the relative quality of a statistical model, for a given set of data. As such, AIC provides a means for model selection. On its own absolute value of AIC is of no use. You can use stepAIC command available in the MASS library to automatically carry out a restricted search for the “best” model, as measured by AIC. stepAIC accepts both categorical and numerical variables. It does not search all possible subsets of variables, but rather it searches in a specific sequence determined by the order of the variables in the formula of the model object.

Deviance is an important measure of fit of a model. It is also used to compare models. The deviance of a model is 2 times the log likelihood of the data under the model plus a constant that would be the same for all models for the same data, and so can be ignored. The larger the deviance, the worse the fit. Also two deviances that of null model (i.e. model with only intercept term) and model with independent variables can be compared to find out if the fit is significantly better. The deviance of difference has chi-square statistic. Residual deviance of 15129 with 31606 degrees of freedom has p-value of 0 as can be seen below and indicating good model fit.

> pchisq(15129,df=31606)
[1] 0

The model, that we constructed, has the form:

log(p/(1-p) = intercept + c1*x1 + c2*x2+ c3*x3…..

The ratio, p/(1-p) is the odds ratio. To calculate odds ratio, we can exponentiate coefficients.

# Find odds ratio as:

# Or better still with 95% confidence interval, as:

The result can is as below:

                    OddsRatio      2.5 %     97.5 %
(Intercept)        0.06961723 0.04519572  0.1069932
age                1.00028622 0.99513022  1.0054570
jobblue-collar     0.78152131 0.65841888  0.9279574
jobentrepreneur    0.77904585 0.58115945  1.0347133
jobhousemaid       0.72911497 0.53184047  0.9892007
jobmanagement      0.92887269 0.78145798  1.1051934
jobretired         1.29321532 1.02837065  1.6251009
jobself-employed   0.77351038 0.59275182  1.0037319
jobservices        0.84381316 0.69202032  1.0271966
jobstudent         1.40906871 1.08643664  1.8231978
jobtechnician      0.88998429 0.75572355  1.0487571
jobunemployed      0.95143826 0.73071041  1.2325361
jobunknown         0.81890624 0.47112825  1.3708220
maritalmarried     0.82212563 0.71735677  0.9439807
maritalsingle      1.08341237 0.92644281  1.2687139
educationsecondary 1.17242630 1.00961551  1.3638226
educationtertiary  1.39792771 1.17313231  1.6678428
educationunknown   1.31578091 1.03286746  1.6707020
defaultyes         1.00709671 0.68672796  1.4360142
balance            1.00001348 1.00000121  1.0000255
housingyes         0.51455711 0.46425005  0.5700844
loanyes            0.62520767 0.54174148  0.7193315
contacttelephone   0.91574909 0.76936952  1.0861102
contactunknown     0.19461557 0.16368937  0.2308932
day                1.00899075 1.00311967  1.0148967
monthaug           0.50289086 0.41809127  0.6050744
monthdec           2.19320466 1.47725368  3.2441840
monthfeb           0.91825609 0.74581870  1.1299667
monthjan           0.25821116 0.19045604  0.3465018
monthjul           0.46745301 0.39008422  0.5602142
monthjun           1.63432029 1.31182006  2.0364832
monthmar           5.84128618 4.42688513  7.7123317
monthmay           0.68267604 0.57604036  0.8096684
monthnov           0.45760929 0.37573926  0.5566459
monthoct           2.63614726 2.04797977  3.3911321
monthsep           2.34312269 1.77218907  3.0948956
duration           1.00411286 1.00396456  1.0042631
campaign           0.91268310 0.89074341  0.9343410
pdays              1.00029198 0.99957596  1.0010053
previous           1.03482952 1.01096152  1.0581795
poutcomeother      1.00827444 0.81254711  1.2474276
poutcomesuccess    9.27986285 7.66483355 11.2562505
poutcomeunknown    1.00205188 0.80113330  1.2553803

It indicates, for example, that with a unit increase in number of retired persons odds increase by a factor of 1.29 that a term deposit would be taken. In the same breath, value of odds ration being less than 1 is, in fact, the value of odds failure. On this see Stata FAQ here.

Recall that we had split the bank dataset into banktrain and banktest (75:25). For constructing model, i.e. for training purposes, we have used banktrain dataset. We can now test how good our model is using banktest dataset and drawing up confusion matrix.

# Calculate probabilities
# We use only columns 1 to 16 of banktest dataset as column 17
# contains the class target, y, to be predicted
# First probabilities of each event
probabilities <- predict.glm(logitModel, banktest[,1:16], type="response")
# Convert probabilities into class values. 
# 0.5 is the threshold value.
predictions <- cut(probabilities, c(-Inf,0.5,Inf), labels=c("no","yes"))
# Draw up confusion matrix
table(predictions, banktest$y)

The confusion matrix is given below:

predictions    no   yes   
        no  11702  1068    
        yes   274   518

There is, therefore, a prediction accuracy of 91.6% in regard to prediction of ‘no’s and 65.4% in regard to prediction of ‘yes’s. We can get a similar result using CVbinary function of DAAG package as below:

> library(DAAG)
> CVbinary(logitModel)

Fold:  5 9 1 6 8 7 3 10 2 4
Internal estimate of accuracy = 0.902
Cross-validation estimate of accuracy = 0.902

You can also examine collinearity among variables and eliminate highly correlated variables as below:


Some good references for logistic regression using R are here: [1] and [2]

In Part-IV we will use mahout from command line to build logit model.

Logistic Regression using RapidMiner Studio 6, R and Mahout command line–Part I

March 5, 2014

Logistic Regression is a multivariate statistical technique often used in predictive analysis where the dependent variable to be predicted is dichotomous but depends upon a number of independent variables. The independent variables may be continuous or categorical. Logistic regression, therefore, differs from normal (linear) regression analysis in that unlike in linear regression analysis where the dependent variable must be continuous in nature, in logistic regression the dependent variable must be categorical. In this respect logistic regression and discriminant analysis deal with the same kind of problems and have the same objective. However, assumptions in the discriminant analysis are more stringent than in logistic regression.
Logistic Regression is very popular in epidemiological studies and in marketing. It answers such questions as given certain body-metrics how likely is it that a patient will have or be susceptible to certain disease. It is also widely used in computing TRISS (Trauma and Injury Severity Score) to predict mortality among severely injured patients. In marketing, it can answer such questions as what steps are more likely to retain customers or how customers propensity to purchase could be enhanced or which customers are likely to default on home mortgage loans. In this four-part series, we will use three open-source tools RapidMiner Studio, R and Mahout (command line) to perform logistic regression on a set of marketing data.
Problem and the data: We will make use of Bank Marketing Data Set available at UCI Machine Learning Repository. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to assess if the product (bank term deposit) would be (or not) subscribed. The data has been analysed by S. Moro, R. Laureano and P. Cortez using three techniques: Naïve Bayes. Decision Trees and Support Vector Mechanics (SCM). Logistic Regression model, however, was not constructed. We will do that now. But first the data. Data description is as below:


Data heading Description
age numeric
job type of job (categorical: admin., unknown, unemployed, management, housemaid, entrepreneur, student, blue-collar, self-employed, retired, technician, services)
marital marital status : married, divorced, single
education unknown, secondary, primary, tertiary
default yes, no
balance numeric
housing housing: yes,no
loan has personal loan: yes,no
contact contact communication : unknown, telephone, cellular
day day: last contact day of the month (numeric)
month last contact month of year : jan, feb, mar, …, nov, dec
duration last contact duration, in seconds (numeric)
campaign number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays number of days that passed by after the client was last contacted from a previous campaign (numeric)
previous number of contacts performed before this campaign and for this client (numeric)
poutcome outcome of the previous marketing campaign : unknown, other, failure, success

A small extract from data-set is shown below. It has 45211 records. The last field ‘y’ is the success or failure of efforts in regard to client taking a Term deposit. It is a binary field and it is this field we are interested in making predictions about for future clients.


The categorical data was transformed using a series of ‘sed‘ statements (full shell script is given next). Few lines of transformed data are as shown below. As words ‘unknown’, ‘yes’, ‘no’ appear as values in many data fields, for the sake of easy transformation, codes for these values were kept uniform throughout.


The data transformation bash shell script is as below. It is liberally commented for easy reading.

# Your home folder
cd ~
# Your data folder & data file
cd $datadir
# Delete previous temp file
rm -f temp.txt
# Transformig first job type (for example unknown is replaced by 2 and unemployed by 3)
sed 's/\"admin\.\"/1/g ; s/\"unknown\"/2/g ; s/\"unemployed\"/3/g ; s/\"management\"/4/g ; s/\"housemaid\"/5/g ; s/\"entrepreneur\"/6/g ; s/\"student\"/7/g ;s/\"blue-collar\"/8/g ; s/\"self-employed\"/9/g ;s/\"retired\"/10/g ; s/\"technician\"/11/g ; s/\"services\"/12/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming next marital status
sed 's/\"married\"/1/g ; s/\"divorced\"/2/g ; s/\"single\"/3/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming now education category
sed 's/\"unknown\"/2/g ; s/\"secondary\"/3/g ; s/\"primary\"/1/g ; s/\"tertiary\"/4/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming credit-in-default, has-housing-loan, has-personal-loan
sed 's/\"yes\"/1/g ; s/\"no\"/0/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming communication type
sed 's/\"unknown\"/2/g ; s/\"telephone\"/3/g ; s/\"cellular\"/1/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming month
sed 's/\"jan\"/1/g ; s/\"feb\"/2/g ;s/\"mar\"/3/g ; s/\"apr\"/4/g ;s/\"may\"/5/g ;s/\"jun\"/6/g ; s/\"jul\"/7/g ; s/\"aug\"/8/g ; s/\"sep\"/9/g ; s/\"oct\"/10/g ; s/\"nov\"/11/g ;s/\"dec\"/12/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
#Transforming campaign success
sed 's/\"unknown\"/2/g ; s/\"other\"/3/g ; s/\"failure\"/4/g ;  s/\"success\"/1/g'  $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile

# Lastly remove semicolon separator by comma separator
sed 's/;/,/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
echo -n "Press a key to finish"; read x

The bank data set contains two csv files: ‘bank-full.csv’ containing 45211 records and another file ‘bank.csv’ containing around 4500 records. Records in file ‘bank.csv’ have been selected on random basis from bank-full.csv. The purpose of records in file bank.csv is to validate logit model created from the full-set. However, as these records also appear in bank-full.csv, we have written a script to randomly select a pre-specified number of records from bank-full.csv (without replacement) and create two files: training.csv and tovalidate.csv. Both contain different records without any duplicacy. File, training.csv, is used to build model and file, tovalidate.csv, to test the model. The shell script is given below:

# Generates cross-validation file to test logistic regression model
# The script randomly picks up n-lines (n to be specified) from the
# given file. It creates two files: training.csv and tovalidate.csv
# The original file remains undisturbed.

# Your home folder
cd ~
# Your data folder & data file


cd $datadir
# Delete earlier sample file & recreate it
rm -f $datadir/$samplefile
touch $datadir/$samplefile
# Delete earlier training file and recreate it
rm -f $datadir/training.csv
cp $datadir/$originalfile $datadir/training.csv
# Delete temp file, if it exists
rm -f temp.txt

# Number of lines in given file
nooflines=`sed -n '$=' $datadir/$originalfile`

echo "No of lines in $datadir/training.csv  are: $nooflines"
echo -n "Specify your sample size (recommended 10% of orig file)? " ; read samplesize

# If nothing specified, default is 50 records
if [ -z $samplesize ] ; then
echo "Default value of sample size = 10"

# Bash loop to generate random numbers
echo "Will generate random numbers between 2 to $samplesize"
echo "Original file size is $nooflines lines"
echo "Wait....."
for (( i = 1 ; i <= $samplesize ; ++i ));
     arr[i]=$(( ( RANDOM % $nooflines )  + 2 ));
     lineno="${arr[i]}"  # Append lines to sample file
     sed -n "${lineno}p" $datadir/training.csv >> $datadir/$samplefile
     # Delete the same line from training.csv
     sed "${lineno}d" $datadir/training.csv > temp.txt
     mv temp.txt $datadir/training.csv
trlines=`sed -n '$=' $datadir/training.csv`
samplines=`sed -n '$=' $datadir/$samplefile`

# Delete temp file
rm -f temp.txt

echo "---------------------------------"
echo "Lines in sample file $samplefile: $samplines"
echo "Lines in training file training.csv : $trlines" ;
echo "Data folder: $datadir"
echo "---------------------------------"

There is no missing data in the full data set. In Part-II we will build logistic model using RapidMiner Studio.