Logistic Regression using RapidMiner Studio 6, R and Mahout command line–Part II

In Part-I of the blog, we have described the data set for logistic modelling. In this blog, we proceed first with setting up RapidMiner Studio for conducting the experiment and then discuss results. RapidMiner Studio 6 can be downloaded from here. The Starter and (open-source) community versions of RapidMiner (rapidminer 5.3.015) are free. The free versions limitation is that complete data should be in memory for analysis. Being Java based, RapidMiner can be run in either Windows or in Linux. There are no installation steps. Declare JAVA_HOME, download and unzip the package and it is ready for work.
Inside scripts folder, look for file: RapidMinerGUI. There are two of them. Select one with .sh or .bat extension as appropriate for your OS; double-click it to start RapidMiner. In what follows, some little familiarity with RapidMiner operators will be desirable.

Using the bash script as mentioned in Part-I, the file bank-full.csv was ripped into two files; around 4000 randomly selected records were stored in  tovalidate.csv file and training.csv file was left with the remaining around 41000 records.

Setting up RapidMiner

In RapidMiner we will:
a. Import files training.csv and tovalidate.csv
b. Given training data (training.csv), we will build a logistic regression model. (Operator: Logistic Regression)
c. Training data will be split in 70:30 ratio. 70% of records go into building model, and 30% records are used to gauge model’s performance. (Operators: Split-validation; Performance; Apply model)
d. At the same time, we use this model to classify for us hitherto unclassified data (data in tovalidate.csv file). Operator: Apply model

Import into RapidMiner repository, data file training.csv as shown in the figures below.

2

Figure–I: In the lower-left repository panel of RapidMiner, click on the down-arrow to begin import of CSV file

While importing data, for ‘y’ data-field, change the ‘attribute’ to ‘label’. Field ‘y’ is the classification field; it lists whether our efforts were successful in getting desired term-deposit from client or not. It is the target field of interest.  Our efforts will be to predict which future customers are likely to  ‘convert’ and make term-deposit with us. Change its data-type to binominal (it is not binomial). For the rest of the fields, the default attributes and data-type values that the import wizard selects are OK (see figure below).

1

Figure-II: For last field ‘y’, change attribute to label and data-type to binominal instead of integer (click image to have a clearer view)

Similarly, import tovalidate.csv. However, as we are interested in making predictions about ‘y’ field in this file, do not import ‘y’ field (deselect it, as below).

2

Figure-III: While importing tovalidate.csv file, de-select the last (target) field, ‘y’.  (Click image to have a clearer view)

Drag, just imported data-source, training.csv, from RapidMiner repository into the middle-window (Process window). Search for ‘Split Validation‘ operator (Evaluation–>Validation–>Split Validation) in the top-left ‘Operators’ box and drag it into Process window.

2

Figure–IV: Split-Validation operator. Drag it into Process window.

Connect the two operators as shown in Figure-V (out port is connected to tra port).

2

Figure-V: The operators are: Split-validation and Apply Model. The data sources are training.csv and tovalidate.csv (click image to have a clearer view)

Next, double-click the lower-right icon (two-overlapping squares) within the Split-validation operator to open another Validation window where model building and validation actions will be performed (See below).

2

Figure-VI: Split-validation window. Drag Logistic regression operator, Apply Model operator and Performance operator to the two parts of window as shown. (click image to enlarge image).

Search for Logistic Regression operator and drag it into the left-panel. Drag Apply Model ( Modeling->Model Application->Apply Model) and Performance (Evaluation->Performance) operators to right panel. Connect operators as shown. Shift back to Process window (by clicking on Process just above the left panel). From the repository (Figure I), drag imported tovalidate.csv data source and also (from Operators window) drag Apply Model operator into the center Process window. Complete all port connections as shown in Figure V.

What we have done is this: In the Process window, Split-validation operator splits the incoming data stream (from training.csv) in 70:30 ratio. It uses the 70% stream to build a logistic regression model (Figure VI, left panel) and to the remaining 30%, it applies this model to evaluate its performance (Figure VI, right panel). All this model building and testing happens within Split Validation operator itself. The output of split-validation operator contains two outputs: the constructed logit model (mod output port) and performance evaluation (ave output port). The output of mod port is fed into Apply model‘s input mod port. Another input port of Apply model receives data from tovalidate data-source. Apply Model‘s two output ports, one with labelled (classified) data (lab port) and the other the model itself (mod port) are connected to inputs of Results window. Similarly ave port of split-validation operator is connected to Results window.

Model Results

Run the setup now. Model building and testing is quite time consuming. On my i7, 8GB laptop, it took 40 minutes. So be patient. There is a whole lot of output in the Results window. To be brief, the logistic model is as follows:

2

Figure VII: The logistic model. The model is a linear relationship between log of odds to weighted independent variables.

And, Confusion matrix is as below. Confusion matrix displays how many of predicted values matched the actual values when cross-validation tests were performed (by split-validation operator). For example, from among records that were predicted with y classificatied as 1, the correct predictions were 10724 and 1205 were incorrect. The confusion matrix shows that in regard to prediction of y= 1, model accuracy is 90% and in regard to prediction of y=2, model accuracy is 66%.

2

Figure VIII: Confusion matrix (click image to enlarge)

Which measure of precision to rely on? And why is there so much difference? The answer to first question depends upon which prediction is more important for you. y=2 or y=1. Prediction for y=1 has more accuracy than for 2. The answer to second question can be seen in the number of cross-validated training cases with y=2. The number of (cross-validated) training cases with y=2 is much less (=1490) than the number of training cases with y=1 classification (=10873). The model, therefore, over-trains (or over-fits) itself when y=1. Over-fitting implies that the model even tries to fit to the noise that appears in y=1 record cases. On the obverse side, model building does not get enough training records to train itself when y=2.  Possibly, by fine-tuning ‘C’ parameter of the Logistic model operator (for example, reducing it from 1 to 0.5) we can tell the model to do less over-fitting and more generalizations. The ROC curve is given below.  Receiver Operating Characteristic curve (or ROC curve)  is a plot of the true positive rate against the false positive rate for the different possible cut-points of a validation test. Area under the ROC curve (AUC) value of 0.5 is worthless meaning the model did not learn anything while an area of 1 means perfect model. AUC value of 0.856 (nearer to 1) is indicative of good fit.

2

Figure IX–:AUC (Area ungnder the ROC curve) is 0.856. This is indicative of good model.

Recall that in the main Process window we are also passing through the logistic model as yet unclassified data (tovalidate.csv minus the y data-field). So, what predictions are there for this data? RapidMiner reports the predictions as in Figure below.

3

Figure X–Green colored column is predicted y. Prediction is based upon calculations in columns Confidence(1), ie confidence in y being 1 and Confidence(2) ie confidence in y being 2.  (Click image to enlarge for clarity)

In Part-III of this series we will build the model using R.

Tags: , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: