Posts Tagged ‘mahout command line’

Logistic Regression using RapidMiner Studio 6, R and Mahout command line–Part IV

March 14, 2014

This is the last part of the series on logistic regression where we use different statistical tools for modelling. Data for modelling is described in Part-1 and in that Part we have also performed some data preparation steps such as separating training and hold-out samples and converting categorical data to integers (levels). In Part-II we have carried out model building using RapidMiner and in Part-III we have used R to create a logistic regression model. In this part we use mahout from command line to build our model.

Given the training data and the validation sample, carrying out analysis from mahout command line is quite straight forward. For best possible results, there are few model parameters that need to be adjusted by trial and error. Mahout training command is ‘trainlogistic’ to train the machine and build the model. The arguments to this command and their explanations are given below.

Command: mahout trainlogistic
  Flag       Any argument to flag      Description
--help                                  print this list
--quiet                                 be extra quiet
--input       input                     where to get training data
--output      output                    where to get training data
--target      target                    the name of the target variable
--categories  number                    the number of target categories to be considered
--predictors p [p ...]                  a list of predictor variables
--types      t [t ...]                  a list of predictor variables types (numeric, word or text)
--passes     passes                     the number of times to pass over the input data
--lambda     lambda                     the amount of coeffiecient decay to use
--rate	     learningRate               the learning rate
--noBias                                do not include a bias term
--features   numFeatures                the number of internal hashed features to use

In model fitting, we try to reduce error between the observed value and the predicted value. Mahout uses Stochastic gradient descent method to reduce the error. In Stochastic gradient descent (SGD), we go on adjusting the model weights so long as the error is getting reduced. That is, we try to walk down the gradient of the error-curve (imagine it to be U-shaped) till a (global) cusp is reached. Any further attempt to reduce the error by adjusting weights will increase  it. In any such process of slipping down the curve, the quantum of slip depends upon the learning rate. Learning rate determines how fast we learn from new values in contrast to old values. Large learning rate places greater emphasis on newer values than the older values. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of data and makes calculations for estimating the value of error at every step. This is very effective in the case of large-scale machine learning problems. It may be pointed out that there may be a global minimum and local minimum of the error curve. The appropriate value of learning rate is, therefore, important. As one descends through the error curve, its slope may show flattening. Both the learning rate as also weights need to be appropriately adjusted. In order to actually reach the minimum, and stay there, we must gradually lower the global learning rate as also weights. Value of lambda is used for the purpose. Number of passes indicates how many times input is passed over for model building.

Once the model is ready, it can be validated and predictions made using the hold-out sample. The mahout command for the purpose is ‘runlogistic’. You can get help on it by using ‘–help’ argument to this command as follows:

[ashokharnal@master mahout]$ mahout runlogistic --help
 Usage:                                                                                                                       [--help --quiet --auc --scores --confusion --input <input> --model <model>]                                                     
   --help                         print this list                                                         
   --quiet                        be extra quiet                                                          
   --auc                          print AUC                                                               
   --scores                       print scores                                                            
   --confusion                    print confusion matrix                                                  
   --input input                  where to get training data                                              
   --model model                  where to get a model        

Given now the ‘training,csv’ file and ‘tovalidate.csv’ file, the complete shell script is written below. The script is well commented to make it clearer.


export HADOOP_CLASSPATH="/usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/*:$HADOOP_CLASSPATH"
# Files, training and hold-out
# Our model will be created in a file by this name

cd $datafolder
# Delete any existing model directory
rm -r -f $datafolder/$modelfolder
# And create again one
mkdir $datafolder/$modelfolder

# Copy tovalidate.csv from datafolder to model folder
cp $datafolder/$validatefile  $datafolder/model/$validatefile 

########Begin Training#################
echo " " ; echo " "
echo "Training phase. Buiding model: logit"
echo "-------------------------------"
echo " "; echo " "

mahout trainlogistic \
  --input $datafolder/$datafile \
  --output $datafolder/$modelfolder/$modelfile \
  --target y \
  --categories 2 \
  --predictors age    job   marital education default     balance  housing  loan  contact day     month  duration  campaign pdays previous  poutcome \
  --types     numeric word  word    word      word        numeric  word     word  word    numeric  word  numeric   numeric  numeric numeric word \
  --features 49  --passes 1000 --lambda 0.0001 \
  --rate 0.5

###############Model validation#################
echo " " ; echo " "
echo "---------------------------------------------------"  
echo -n "Validation phase. Begin validation? (Press ENTER key) " ; read x
echo "---------------------------------------------------"  

mahout runlogistic \
  --input $datafolder/$modelfolder/$validatefile \
  --model $datafolder/$modelfolder/$modelfile \
   --scores --auc --confusion  > $datafolder/output.txt

echo -n "Score and other output is in file output.txt"   ; read x

The ‘categories’ are obviously 2 (target y: yes/no). Why have we chosen ‘features’ as 49? It is because the total number of (categorical+numerical) levels per record are 49. For example, attribute ‘job’ has 12 levels. After selecting this value, one can play around it (say increase to 100 or reduce to 30) to see which value is the best. For an easy explanation of what are ‘features’ and ‘Features hash’ or what is know as ‘hash trick’ please refer to a) Wikipedia, b) StackExchange Q/A and c) this article in the order listed. These are the simplest explanations I could find.

In our case, minimum number of features in categorical variables and numeric variable (count as 1 per numeric variable) are 49. While training, we have differentiated between the categorical variable types and numeric variable types through symbols ‘n’ and ‘word’. The model is as follows.

y ~ -0.796*Intercept Term + -17.705*age + 0.006*balance + -0.732*campaign 
+ -0.396*contact=1 + -0.836*contact=2 + -0.305*contact=3 + -2.182*day 
+ -0.906*default=0 + -0.002*default=1 + 0.738*duration + -0.500*education=1 
+ -2.623*education=2 + -1.132*education=3 + -2.977*education=4 + -2.258*housing=0 
+ -0.784*housing=1 + 0.013*job=1 + -0.332*job=10 + -0.486*job=11 + -17.704*job=12 
+ -0.795*job=2 + -0.820*job=3 + -0.739*job=4 + -0.090*job=5 + -0.805*job=6 + -0.396*job=7 
+ -0.629*job=8 + -0.466*job=9 + -1.121*loan=0 + -0.149*loan=1 + -0.543*marital=1 
+ -1.024*marital=2 + -0.587*marital=3 + -0.120*month=1 + -0.279*month=10 + -0.442*month=11 
+ -0.276*month=12 + -0.373*month=2 + -0.326*month=3 + -0.502*month=4 + -1.191*month=5 
+ -0.001*month=6 + -0.001*month=7 + -0.394*month=8 + -0.342*month=9 + 0.394*pdays 
+ -0.395*poutcome=1 + -1.190*poutcome=2 + 0.738*poutcome=3 + -18.216*poutcome=4 + 0.000*previous
      Intercept Term -0.79622
                 age -17.70507
             balance 0.00645
            campaign -0.73175
           contact=1 -0.39627
           contact=2 -0.83619
           contact=3 -0.30474
                 day -2.18161
           default=0 -0.90618
           default=1 -0.00165
            duration 0.73830
         education=1 -0.49953
         education=2 -2.62331
         education=3 -1.13234
         education=4 -2.97668
           housing=0 -2.25766
           housing=1 -0.78387
               job=1 0.01325
              job=10 -0.33196
              job=11 -0.48632
              job=12 -17.70413
               job=2 -0.79466
               job=3 -0.82046
               job=4 -0.73949
               job=5 -0.09014
               job=6 -0.80492
               job=7 -0.39627
               job=8 -0.62859
               job=9 -0.46582
              loan=0 -1.12142
              loan=1 -0.14857
           marital=1 -0.54272
           marital=2 -1.02386
           marital=3 -0.58652
             month=1 -0.12023
            month=10 -0.27851
            month=11 -0.44170
            month=12 -0.27578
             month=2 -0.37268
             month=3 -0.32635
             month=4 -0.50196
             month=5 -1.19071
             month=6 -0.00078
             month=7 -0.00079
             month=8 -0.39449
             month=9 -0.34164
               pdays 0.39449
          poutcome=1 -0.39548
          poutcome=2 -1.18956
          poutcome=3 0.73830
          poutcome=4 -18.21576
            previous 0.00000

The model becomes quite complex and practically littered with dummy variables. The number of independent dummy variables are quite a lot. We could, for example, have treated ‘month’ variable as numeric to simplify things. The validation output is as below:

AUC = 0.84
confusion: [[4144.0, 199.0], [71.0, 86.0]]
entropy: [[NaN, NaN], [-45.2, -9.4]]

Area under ROC (Receiver operating characteristics) curve known as AUC illustrates the discriminating power of the model between true positive rate (i.e. the number of 1s correctly classified) and false positive rate (number of 0s wrongly classified). For a simple explanation of AUC please refer this entry in my blog: Simple explanation of AUC. While AUC of 0.84 may be good enough (max: 1, higher the better), the results in confusion matrix may not be that encouraging. As per the confusion matrix, out of a total of 4343 predictions as ‘no’, 4144 or 95.4% are correct. And out of predictions for ‘yes’ in 157 cases, 55% are correct predictions. You may experiment with changing learning rate and lamda. But in this particular case this is the best we could get.

Logistic Regression using RapidMiner Studio 6, R and Mahout command line–Part I

March 5, 2014

Logistic Regression is a multivariate statistical technique often used in predictive analysis where the dependent variable to be predicted is dichotomous but depends upon a number of independent variables. The independent variables may be continuous or categorical. Logistic regression, therefore, differs from normal (linear) regression analysis in that unlike in linear regression analysis where the dependent variable must be continuous in nature, in logistic regression the dependent variable must be categorical. In this respect logistic regression and discriminant analysis deal with the same kind of problems and have the same objective. However, assumptions in the discriminant analysis are more stringent than in logistic regression.
Logistic Regression is very popular in epidemiological studies and in marketing. It answers such questions as given certain body-metrics how likely is it that a patient will have or be susceptible to certain disease. It is also widely used in computing TRISS (Trauma and Injury Severity Score) to predict mortality among severely injured patients. In marketing, it can answer such questions as what steps are more likely to retain customers or how customers propensity to purchase could be enhanced or which customers are likely to default on home mortgage loans. In this four-part series, we will use three open-source tools RapidMiner Studio, R and Mahout (command line) to perform logistic regression on a set of marketing data.
Problem and the data: We will make use of Bank Marketing Data Set available at UCI Machine Learning Repository. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to assess if the product (bank term deposit) would be (or not) subscribed. The data has been analysed by S. Moro, R. Laureano and P. Cortez using three techniques: Naïve Bayes. Decision Trees and Support Vector Mechanics (SCM). Logistic Regression model, however, was not constructed. We will do that now. But first the data. Data description is as below:


Data heading Description
age numeric
job type of job (categorical: admin., unknown, unemployed, management, housemaid, entrepreneur, student, blue-collar, self-employed, retired, technician, services)
marital marital status : married, divorced, single
education unknown, secondary, primary, tertiary
default yes, no
balance numeric
housing housing: yes,no
loan has personal loan: yes,no
contact contact communication : unknown, telephone, cellular
day day: last contact day of the month (numeric)
month last contact month of year : jan, feb, mar, …, nov, dec
duration last contact duration, in seconds (numeric)
campaign number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays number of days that passed by after the client was last contacted from a previous campaign (numeric)
previous number of contacts performed before this campaign and for this client (numeric)
poutcome outcome of the previous marketing campaign : unknown, other, failure, success

A small extract from data-set is shown below. It has 45211 records. The last field ‘y’ is the success or failure of efforts in regard to client taking a Term deposit. It is a binary field and it is this field we are interested in making predictions about for future clients.


The categorical data was transformed using a series of ‘sed‘ statements (full shell script is given next). Few lines of transformed data are as shown below. As words ‘unknown’, ‘yes’, ‘no’ appear as values in many data fields, for the sake of easy transformation, codes for these values were kept uniform throughout.


The data transformation bash shell script is as below. It is liberally commented for easy reading.

# Your home folder
cd ~
# Your data folder & data file
cd $datadir
# Delete previous temp file
rm -f temp.txt
# Transformig first job type (for example unknown is replaced by 2 and unemployed by 3)
sed 's/\"admin\.\"/1/g ; s/\"unknown\"/2/g ; s/\"unemployed\"/3/g ; s/\"management\"/4/g ; s/\"housemaid\"/5/g ; s/\"entrepreneur\"/6/g ; s/\"student\"/7/g ;s/\"blue-collar\"/8/g ; s/\"self-employed\"/9/g ;s/\"retired\"/10/g ; s/\"technician\"/11/g ; s/\"services\"/12/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming next marital status
sed 's/\"married\"/1/g ; s/\"divorced\"/2/g ; s/\"single\"/3/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming now education category
sed 's/\"unknown\"/2/g ; s/\"secondary\"/3/g ; s/\"primary\"/1/g ; s/\"tertiary\"/4/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming credit-in-default, has-housing-loan, has-personal-loan
sed 's/\"yes\"/1/g ; s/\"no\"/0/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming communication type
sed 's/\"unknown\"/2/g ; s/\"telephone\"/3/g ; s/\"cellular\"/1/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
# Transforming month
sed 's/\"jan\"/1/g ; s/\"feb\"/2/g ;s/\"mar\"/3/g ; s/\"apr\"/4/g ;s/\"may\"/5/g ;s/\"jun\"/6/g ; s/\"jul\"/7/g ; s/\"aug\"/8/g ; s/\"sep\"/9/g ; s/\"oct\"/10/g ; s/\"nov\"/11/g ;s/\"dec\"/12/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
#Transforming campaign success
sed 's/\"unknown\"/2/g ; s/\"other\"/3/g ; s/\"failure\"/4/g ;  s/\"success\"/1/g'  $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile

# Lastly remove semicolon separator by comma separator
sed 's/;/,/g' $datadir/$trainfile > temp.txt
mv temp.txt $datadir/$trainfile
echo -n "Press a key to finish"; read x

The bank data set contains two csv files: ‘bank-full.csv’ containing 45211 records and another file ‘bank.csv’ containing around 4500 records. Records in file ‘bank.csv’ have been selected on random basis from bank-full.csv. The purpose of records in file bank.csv is to validate logit model created from the full-set. However, as these records also appear in bank-full.csv, we have written a script to randomly select a pre-specified number of records from bank-full.csv (without replacement) and create two files: training.csv and tovalidate.csv. Both contain different records without any duplicacy. File, training.csv, is used to build model and file, tovalidate.csv, to test the model. The shell script is given below:

# Generates cross-validation file to test logistic regression model
# The script randomly picks up n-lines (n to be specified) from the
# given file. It creates two files: training.csv and tovalidate.csv
# The original file remains undisturbed.

# Your home folder
cd ~
# Your data folder & data file


cd $datadir
# Delete earlier sample file & recreate it
rm -f $datadir/$samplefile
touch $datadir/$samplefile
# Delete earlier training file and recreate it
rm -f $datadir/training.csv
cp $datadir/$originalfile $datadir/training.csv
# Delete temp file, if it exists
rm -f temp.txt

# Number of lines in given file
nooflines=`sed -n '$=' $datadir/$originalfile`

echo "No of lines in $datadir/training.csv  are: $nooflines"
echo -n "Specify your sample size (recommended 10% of orig file)? " ; read samplesize

# If nothing specified, default is 50 records
if [ -z $samplesize ] ; then
echo "Default value of sample size = 10"

# Bash loop to generate random numbers
echo "Will generate random numbers between 2 to $samplesize"
echo "Original file size is $nooflines lines"
echo "Wait....."
for (( i = 1 ; i <= $samplesize ; ++i ));
     arr[i]=$(( ( RANDOM % $nooflines )  + 2 ));
     lineno="${arr[i]}"  # Append lines to sample file
     sed -n "${lineno}p" $datadir/training.csv >> $datadir/$samplefile
     # Delete the same line from training.csv
     sed "${lineno}d" $datadir/training.csv > temp.txt
     mv temp.txt $datadir/training.csv
trlines=`sed -n '$=' $datadir/training.csv`
samplines=`sed -n '$=' $datadir/$samplefile`

# Delete temp file
rm -f temp.txt

echo "---------------------------------"
echo "Lines in sample file $samplefile: $samplines"
echo "Lines in training file training.csv : $trlines" ;
echo "Data folder: $datadir"
echo "---------------------------------"

There is no missing data in the full data set. In Part-II we will build logistic model using RapidMiner Studio.

Discovering frequent patterns using mahout command line

February 18, 2014

We show here how mahout can be used from command line to mine frequently occurring patterns from data-sets. The data set used is marketbasket.csv file available from the website of Dr. Tariq Mahmood at National University of Computer and Emerging Sciences, Karachi, Pakistan. The file, marketbasket.csv, stores transaction information somewhat as below:

TransID, Grape, Mango, Orange.... 
C2, true, true, false      => Grape and Mango were purchased.Transaction id C2
C3, false, true, true      => Mango, Orange but not Grape were purchased
C4, true, false, false     => Only Grape was purchased. Trans id C4

For purposes of frequent pattern mining in mahout the required format can be either one space separated item-ids or comma separated item ids as below:

2    3    7                => Itemids 2, 3 and 7 were purchased
5    8    3                => Itemids 5, 8 and 3 were purchased
2    3    34    67         => Keep (only) one space between item-ids

OR, comma separated item-ids, as

2,3,7                  => Itemids 2, 3 and 7 were purchased
5,8,3                  => Itemids 5, 8 and 3 were purchased

Note that we do not need transaction IDs (first column of marketbasket.csv) and also the first line containing item names. Before we use mahout to discover patterns in marketbasket.csv, we need to convert it to either of the above two formats. We will use awk script for the conversion. Awk is very convenient to use whenever in a text file data is arranged in a columnar fashion. The complete script is as below,

###Shell script to convert marketbasket.csv to .dat and .csv format###

# The awk script converts marketbasket.csv
# to marketbasket.dat and market.csv
# File marketbasket.csv remains undisturbed
cd ~
cd $datadir

awk -F, ' BEGIN { line_counter = 0 ; }
    # count lines; we ignore the first item-names line
    # Replace every 'true' field with its column number.
    # NF is an awk variable that stores the col no of last field
    for (x = 1; x >= NF ; ++x)
      # Check if 'true' is embedded in the field and if yes,
      # replace by column number. index() is an awk function
      i = index($x,"true");
      if ( i != 0 ) $x=x ;
      } # End of for loop 
    } # End of second block
    # Replace every 'false' with blank. gsub() is an awk function.
    { gsub(/false/," ") }
    # Replace first field starting with C (and followed by some numbers) with blank
    # Field starting with C is Basket-id. We do not need it for pattern analysis
    { gsub(/^C[0-9]*/,"")}
    # Replace multiple spaces with single space
    {gsub(/  */," ")}
    # Trim beginning line space
    {gsub(/^ /,"")}
    # Do not print out the first line with item names
    if ( line_counter != 1 ) print $0 ;
  }' marketbasket.csv | sed 's/\s*$//g' > marketbasket.dat
    # Trim ending line space before creating .dat file

# Replace space with comma and finish
sed 's/\s/,/g' marketbasket.dat > market.csv

The above script generates two files. 1) marketbasket.bat and the other market.csv. Both are in the form appropriate for fpg analysis with mahout. We now use the following shell script to carry out fpg analysis using mahout. The code is liberally commented.

#####Shell script to run mahout fpg analysis#####


cd ~
# Folder where marketbasket.bat and market.csv exist
# Folder in hadoop where we will store marketbasket.dat and market.csv
cd $datadir

# export hadoop library classpath
export HADOOP_CLASSPATH="/usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/*:$HADOOP_CLASSPATH"

# Copy data files to hadoop folder
hdfs dfs -put $datadir/marketbasket.dat marketbasket.dat

# Run mahout fpg command on .dat file. Output goes to 'patterns' folder. Note
# the -regex flag. It expects that item nos are separated by single space.
# We have specified a minimum support of 2 patterns (-s 2) in the dataset for
# it to be listed in the output. Maximum patterns are 50 (-k 50).

mahout fpg \
-i $hadoopfolder/marketbasket.dat \
-o patterns \
-k 50 \
-method mapreduce \
-regex '[\ ]' -s 2

# Next, we output 50 key-value (pattern) pairs from two part-r-0000? files
# from patterns folder on hadoop.

mahout seqdumper -i \
patterns/frequentpatterns/part-r-00000 \
-o $datadir/result_1.txt \
-n 50

mahout seqdumper \
-i patterns/frequentpatterns/part-r-00001 \
-o $datadir/result_2.txt \
-n 50

The resulting output in file result_1.txt and in result_2.txt is something as below:

Max Items to dump: 50
Key: 10: Value: ([10],40), ([126, 10],28), ([143, 10],21), ([134, 10],21), ([18, 10],20),
([143, 126, 10],19), ([134, 126, 10],19), ([287, 10],19), ([234, 10],19), ([134, 18, 10],18),
([126, 234, 10],17), ([126, 18, 10],17), ([134, 126, 18, 10],16), ([134, 143, 10],16),
([126, 287, 10],16), ([143, 287, 10],15), ([143, 234, 10],15), ([143, 18, 10],15),
([134, 287, 10],15), ([143, 126, 287, 10],14), ([143, 126, 234, 10],14), ([134, 18, 287, 10],14),
([134, 143, 18, 10],14), ([134, 143, 126, 10],14), ([134, 126, 287, 10],14), 
([134, 126, 18, 287, 10],13), ([143, 126, 18, 10],13), ([134, 143, 287, 10],13),
([134, 234, 10],13), ([134, 143, 18, 287, 10],12), 

The entry ([10],40) means item 10 occurs 40 times. And entry ([134, 126, 10],19) means combination of items 134, 126 and 10 occurs 19 times. However, it is difficult to understand item numbers. We convert item numbers back to item names by simple repeated use of sed scripts as below:

######Shell script to replace item-id by item name#######

# Find item no (same as col no) in file result_1.txt and replace by item-names.
# item-names are read from the first line of marketbasket.csv.
# File, final_result_1.txt, is the output file with item-names

cd ~
cd $datafolder

#Read first line of marketbasket.csv file
line=$(head -n 1 marketbasket.csv)

# Trim all spaces (only) around commas (but not between two words)
echo $line | sed -e 's/, */,/g' &gt; temp.txt
# Read temp.txt again &amp; replace remaining space (between two words) by dash (-)
# ice cream becomes ice-cream
line=$(head -n 1 temp.txt)
echo $line | sed -e 's/ /-/g' &gt; temp.txt

# Next, all item names are collected into an array
line=$(head -n 1 temp.txt)
IFS=', ' read -a array &lt;&lt;&lt; "$line"

# In a loop, read all array values (ie item-names). One-by-one,
# column no in file result_1.txt is replaced with corresponding item-name
for element in "${array[@]}"
  # Replace pattern as: 23, 56, 78,
  sed "s/ $i\,/$element\,/g" $rptfile &gt; temp1.txt
  cp temp1.txt tmp.txt

  # Replace pattern as: 23], 56], 78],
  sed "s/ $i\]\,/ $element\]\,/g" $rptfile &gt; temp1.txt
  cp temp1.txt tmp.txt

  # Replace pattern as: [23, [56, [78, with item name
  sed "s/\[$i\,/\[$element\,/g" $rptfile &gt; temp1.txt
  cp temp1.txt tmp.txt

  # Replace pattern as: [23], [56], [78] with item name
  sed "s/\[$i\]/\[$element\]/g" $rptfile &gt; temp1.txt
  cp temp1.txt tmp.txt

  i=$(($i + 1))
# Moving results to file: final_result_1.txt
mv tmp.txt final_result_1.txt
# Delete temporary files
rm -f temp1.txt
rm -f temp.txt
# Similarly convert file result_2.txt

The resulting output in file, final_result_1.txt, is as follows:

Key: 10: Value: ([Dishwasher-Detergent],40), ([2pct.-Milk, Dishwasher-Detergent],28),
([White-Bread, Dishwasher-Detergent],21), ([Eggs, Dishwasher-Detergent],21),
([Potato-Chips, Dishwasher-Detergent],20), ([White-Bread,2pct.-Milk, Dishwasher-Detergent],19), 
([Eggs,2pct.-Milk, Dishwasher-Detergent],19), ([Aspirin, Dishwasher-Detergent],19),
([Wheat-Bread, Dishwasher-Detergent],19), ([Eggs,Potato-Chips, Dishwasher-Detergent],18),
([2pct.-Milk,Wheat-Bread, Dishwasher-Detergent],17), ([2pct.-Milk,Potato-Chips, Dishwasher-
Detergent],17), ([Eggs,2pct.-Milk,Potato-Chips, Dishwasher-Detergent],16), ([Eggs,White-Bread, 
Dishwasher-Detergent],16), ([2pct.-Milk,Aspirin, Dishwasher-Detergent],16),

Incidentally, if you had wanted to use ‘market.csv’ (the second output file from awk-script) for pattern analysis, then mahout command would have been as below. The default -regex pattern is [[ , ]*[,| ][ , ]*] and expects comma separated IDs as in market.csv.

mahout fpg \
-i $hadoopfolder/market.csv \
-o patternscsv \
-k 50 -method mapreduce \
-s 2 

In the mahout fpg command, the method ( -method) used is ‘mapreduce’. Instead, sequential, method could have been used. In this method the ‘.dat’ file is read from Linux file system (not hadoop file system) and, therefore, there cannot be parallel analysis of file-data. However, results are stored in hadoop file system and can be extracted using seqdumper. You may find it useful to refer to this Apache site.

A number of data sets for frequent pattern analysis are available Frequent Itemset Mining Dataset Repository. Extended Bakery dataset is available here. An excellent analysis is available here.

Text clustering using Mahout command line–Step-by-Step

February 9, 2014

Here is a step by step guide to Mahout command line text clustering. I have Cloudera hadoop ecosystem installed on a single machine (CentOS 6.5). Mahout was installed thereafter with the simple command:

# yum install mahout

The installed mahout is version 0.7. Check as:

$rpm -ql  mahout

For this experiment copy a few Wikipedia articles to your favourite text editor and save them as text files in a folder. I copied the articles related to Finance and Quantum Mechanics (folder repo/mytext/). Following is the shell script that analyses the text files, clusters them and finally prints two clusters along with a list of files. I use k-means clustering.

############Shell Script########################
cd ~
# What is my home folder
# This file stores analysis results
# Folder containing Wikipedia articles
# Folder on hadoop wherein other folders
#   will be created subsequently

# Step 1: Mahout will need to know path to hadoop jar files.
#   In Cloudera installed on CentOS, the path is as below:

export HADOOP_CLASSPATH=”/usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/*:$HADOOP_CLASSPATH”

# Step 2:Copy now your Wiki text folder to hdfs folder
#  A folder /user/ashokharnal/mytext is created:

hdfs dfs -put  $textFolder/  $hdfsFolder/

# Step 3: Convert text files in the hadoop folder to sequence format:

mahout seqdirectory  \
-i  $hdfsFolder/mytext \
-o  $hdfsFolder/mytext-seq  \
-c  UTF-8 \
-chunk 5

# Step 4: Convert sequence format to sparse vector format
#    Output stored in mytext-vectors
#    Flag -nv also keeps file names

mahout seq2sparse \
-nv   -i  $hdfsFolder/mytext-seq/  \
-o   $hdfsFolder/mytext-vectors

# Step 5: Create two k-means cluster now:

mahout kmeans -i $hdfsFolder/mytext-vectors/tfidf-vectors/  \
-c $hdfsFolder/mytext-kmeansSeed  \
-o $hdfsFolder/mytext-clusters   \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure  \
– -clustering  -cl  -cd  0.1  -x  10  -k  2  -ow

(Flag clustering above has two consecutive dashes before it without any spaces)

# Step 6: Print output to a text file on Linux file system:

mahout clusterdump  -i  $hdfsFolder/mytext-clusters/clusters-0   \
-o  $resultFile   \
-d  $hdfsFolder/mytext-vectors/dictionary.file-0 \
-b  100   \
-p  $hdfsFolder/mytext-clusters/clusteredPoints    \
-dt  sequencefile   -n  20

cat $resultFile

# Step 7: Print file names vs clusters using mahout seqdumper
#   In the result (tmp.txt), replace anything between [  ] with
#     spaces and sort the results in order of key value:

mahout seqdumper  \
-i $hdfsFolder/mytext-clusters/clusteredPoints/part-m-00000 > tmp.txt

# Use sed, stream editor, to parse the above output

sed -e “s/= \[.*\]/ /” tmp.txt  |  sort
rm -r -f tmp.txt
############End of shell script############

The output of sed command will be something like:

Input Path: /user/ashokharnal/mytext-clusters/clusteredPoints/part-m-00000
Key: 0: Value: 1.0: /Corporate Finance
Key: 0: Value: 1.0: /Finance
Key: 0: Value: 1.0: /Financial Capitalism
Key: 0: Value: 1.0: /Financial Markets
Key: 0: Value: 1.0: /Personal Finance
Key: 7: Value: 1.0: /Introduction to Quantum mechanics
Key: 7: Value: 1.0: /Quantum Circuit
Key: 7: Value: 1.0: /Quantum Computer
Key: 7: Value: 1.0: /Quantum mechanics

Two file clusters have been created, one labelled Key=0 and the other Key=7.

For theory behind text-clustering explained in a simple way refer to this three-part excellent tutorial.