Logistic Regression using RapidMiner Studio 6, R and Mahout command line–Part I

Logistic Regression is a multivariate statistical technique often used in predictive analysis where the dependent variable to be predicted is dichotomous but depends upon a number of independent variables. The independent variables may be continuous or categorical. Logistic regression, therefore, differs from normal (linear) regression analysis in that unlike in linear regression analysis where the dependent variable must be continuous in nature, in logistic regression the dependent variable must be categorical. In this respect logistic regression and discriminant analysis deal with the same kind of problems and have the same objective. However, assumptions in the discriminant analysis are more stringent than in logistic regression.
Logistic Regression is very popular in epidemiological studies and in marketing. It answers such questions as given certain body-metrics how likely is it that a patient will have or be susceptible to certain disease. It is also widely used in computing TRISS (Trauma and Injury Severity Score) to predict mortality among severely injured patients. In marketing, it can answer such questions as what steps are more likely to retain customers or how customers propensity to purchase could be enhanced or which customers are likely to default on home mortgage loans. In this four-part series, we will use three open-source tools RapidMiner Studio, R and Mahout (command line) to perform logistic regression on a set of marketing data.
Problem and the data: We will make use of Bank Marketing Data Set available at UCI Machine Learning Repository. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to assess if the product (bank term deposit) would be (or not) subscribed. The data has been analysed by S. Moro, R. Laureano and P. Cortez using three techniques: Naïve Bayes. Decision Trees and Support Vector Mechanics (SCM). Logistic Regression model, however, was not constructed. We will do that now. But first the data. Data description is as below:


Data heading Description
age numeric
job type of job (categorical: admin., unknown, unemployed, management, housemaid, entrepreneur, student, blue-collar, self-employed, retired, technician, services)
marital marital status : married, divorced, single
education unknown, secondary, primary, tertiary
default yes, no
balance numeric
housing housing: yes,no
loan has personal loan: yes,no
contact contact communication : unknown, telephone, cellular
day day: last contact day of the month (numeric)
month last contact month of year : jan, feb, mar, …, nov, dec
duration last contact duration, in seconds (numeric)
campaign number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays number of days that passed by after the client was last contacted from a previous campaign (numeric)
previous number of contacts performed before this campaign and for this client (numeric)
poutcome outcome of the previous marketing campaign : unknown, other, failure, success

A small extract from data-set is shown below. It has 45211 records. The last field ‘y’ is the success or failure of efforts in regard to client taking a Term deposit. It is a binary field and it is this field we are interested in making predictions about for future clients.


The categorical data was transformed using a series of ‘sed‘ statements (full shell script is given next). Few lines of transformed data are as shown below. As words ‘unknown’, ‘yes’, ‘no’ appear as values in many data fields, for the sake of easy transformation, codes for these values were kept uniform throughout.


The data transformation bash shell script is as below. It is liberally commented for easy reading.

# Your home folder
cd ~
# Your data folder & data file
cd $datadir
# Delete previous temp file
rm -f temp.txt
# Transformig first job type (for example unknown is replaced by 2 and unemployed by 3)
sed 's/\"admin\.\"/1/g ; s/\"unknown\"/2/g ; s/\"unemployed\"/3/g ; s/\"management\"/4/g ; s/\"housemaid\"/5/g ; s/\"entrepreneur\"/6/g ; s/\"student\"/7/g ;s/\"blue-collar\"/8/g ; s/\"self-employed\"/9/g ;s/\"retired\"/10/g ; s/\"technician\"/11/g ; s/\"services\"/12/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming next marital status
sed 's/\"married\"/1/g ; s/\"divorced\"/2/g ; s/\"single\"/3/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming now education category
sed 's/\"unknown\"/2/g ; s/\"secondary\"/3/g ; s/\"primary\"/1/g ; s/\"tertiary\"/4/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming credit-in-default, has-housing-loan, has-personal-loan
sed 's/\"yes\"/1/g ; s/\"no\"/0/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming communication type
sed 's/\"unknown\"/2/g ; s/\"telephone\"/3/g ; s/\"cellular\"/1/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming month
sed 's/\"jan\"/1/g ; s/\"feb\"/2/g ;s/\"mar\"/3/g ; s/\"apr\"/4/g ;s/\"may\"/5/g ;s/\"jun\"/6/g ; s/\"jul\"/7/g ; s/\"aug\"/8/g ; s/\"sep\"/9/g ; s/\"oct\"/10/g ; s/\"nov\"/11/g ;s/\"dec\"/12/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
#Transforming campaign success
sed 's/\"unknown\"/2/g ; s/\"other\"/3/g ; s/\"failure\"/4/g ;  s/\"success\"/1/g'  $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile

# Lastly remove semicolon separator by comma separator
sed 's/;/,/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
echo -n "Press a key to finish"; read x

The bank data set contains two csv files: ‘bank-full.csv’ containing 45211 records and another file ‘bank.csv’ containing around 4500 records. Records in file ‘bank.csv’ have been selected on random basis from bank-full.csv. The purpose of records in file bank.csv is to validate logit model created from the full-set. However, as these records also appear in bank-full.csv, we have written a script to randomly select a pre-specified number of records from bank-full.csv (without replacement) and create two files: training.csv and tovalidate.csv. Both contain different records without any duplicacy. File, training.csv, is used to build model and file, tovalidate.csv, to test the model. The shell script is given below:

# Generates cross-validation file to test logistic regression model
# The script randomly picks up n-lines (n to be specified) from the
# given file. It creates two files: training.csv and tovalidate.csv
# The original file remains undisturbed.

# Your home folder
cd ~
# Your data folder & data file


cd $datadir
# Delete earlier sample file & recreate it
rm -f $datadir/$samplefile
touch $datadir/$samplefile
# Delete earlier training file and recreate it
rm -f $datadir/training.csv
cp $datadir/$originalfile $datadir/training.csv
# Delete temp file, if it exists
rm -f temp.txt

# Number of lines in given file
nooflines=`sed -n '$=' $datadir/$originalfile`

echo "No of lines in $datadir/training.csv  are: $nooflines"
echo -n "Specify your sample size (recommended 10% of orig file)? " ; read samplesize

# If nothing specified, default is 50 records
if [ -z $samplesize ] ; then
echo "Default value of sample size = 10"

# Bash loop to generate random numbers
echo "Will generate random numbers between 2 to $samplesize"
echo "Original file size is $nooflines lines"
echo "Wait....."
for (( i = 1 ; i <= $samplesize ; ++i ));
     arr[i]=$(( ( RANDOM % $nooflines )  + 2 ));
     lineno="${arr[i]}"  # Append lines to sample file
     sed -n "${lineno}p" $datadir/training.csv >> $datadir/$samplefile
     # Delete the same line from training.csv
     sed "${lineno}d" $datadir/training.csv > temp.txt
     mv temp.txt $datadir/training.csv
trlines=`sed -n '$=' $datadir/training.csv`
samplines=`sed -n '$=' $datadir/$samplefile`

# Delete temp file
rm -f temp.txt

echo "---------------------------------"
echo "Lines in sample file $samplefile: $samplines"
echo "Lines in training file training.csv : $trlines" ;
echo "Data folder: $datadir"
echo "---------------------------------"

There is no missing data in the full data set. In Part-II we will build logistic model using RapidMiner Studio.


Tags: , , , , , ,

4 Responses to “Logistic Regression using RapidMiner Studio 6, R and Mahout command line–Part I”

  1. personal finance company online payment Says:

    Hi i am kavin, its my first time to commenting anywhere,
    when i read this paragraph i thought i could also create comment due to this brilliant article.

  2. Arron Says:

    I read a lot of interesting articles here. Probably you spend a lot
    of time writing, i know how to save you a lot of time, there is an online tool that creates readable, google
    friendly posts in seconds, just search in google – laranita free content source

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: