Discovering frequent patterns using mahout command line

We show here how mahout can be used from command line to mine frequently occurring patterns from data-sets. The data set used is marketbasket.csv file available from the website of Dr. Tariq Mahmood at National University of Computer and Emerging Sciences, Karachi, Pakistan. The file, marketbasket.csv, stores transaction information somewhat as below:

TransID, Grape, Mango, Orange.... 
C2, true, true, false      => Grape and Mango were purchased.Transaction id C2
C3, false, true, true      => Mango, Orange but not Grape were purchased
C4, true, false, false     => Only Grape was purchased. Trans id C4

For purposes of frequent pattern mining in mahout the required format can be either one space separated item-ids or comma separated item ids as below:


2    3    7                => Itemids 2, 3 and 7 were purchased
5    8    3                => Itemids 5, 8 and 3 were purchased
2    3    34    67         => Keep (only) one space between item-ids

OR, comma separated item-ids, as

2,3,7                  => Itemids 2, 3 and 7 were purchased
5,8,3                  => Itemids 5, 8 and 3 were purchased
2,3,34,67

Note that we do not need transaction IDs (first column of marketbasket.csv) and also the first line containing item names. Before we use mahout to discover patterns in marketbasket.csv, we need to convert it to either of the above two formats. We will use awk script for the conversion. Awk is very convenient to use whenever in a text file data is arranged in a columnar fashion. The complete script is as below,

###Shell script to convert marketbasket.csv to .dat and .csv format###

#!/bin/sh
# The awk script converts marketbasket.csv
# to marketbasket.dat and market.csv
# File marketbasket.csv remains undisturbed
cd ~
myhome=`pwd`
datadir="$myhome/Documents/marketbasket"
cd $datadir

awk -F, ' BEGIN { line_counter = 0 ; }
  {
    # count lines; we ignore the first item-names line
    ++line_counter
    {
    # Replace every 'true' field with its column number.
    # NF is an awk variable that stores the col no of last field
    for (x = 1; x >= NF ; ++x)
      {
      # Check if 'true' is embedded in the field and if yes,
      # replace by column number. index() is an awk function
      i = index($x,"true");
      if ( i != 0 ) $x=x ;
      } # End of for loop 
    } # End of second block
    # Replace every 'false' with blank. gsub() is an awk function.
    { gsub(/false/," ") }
    # Replace first field starting with C (and followed by some numbers) with blank
    # Field starting with C is Basket-id. We do not need it for pattern analysis
    { gsub(/^C[0-9]*/,"")}
    # Replace multiple spaces with single space
    {gsub(/  */," ")}
    # Trim beginning line space
    {gsub(/^ /,"")}
    # Do not print out the first line with item names
    if ( line_counter != 1 ) print $0 ;
  }' marketbasket.csv | sed 's/\s*$//g' > marketbasket.dat
    # Trim ending line space before creating .dat file

# Replace space with comma and finish
sed 's/\s/,/g' marketbasket.dat > market.csv

The above script generates two files. 1) marketbasket.bat and the other market.csv. Both are in the form appropriate for fpg analysis with mahout. We now use the following shell script to carry out fpg analysis using mahout. The code is liberally commented.

#####Shell script to run mahout fpg analysis#####

#!/bin/sh

cd ~
myhome=`pwd`
# Folder where marketbasket.bat and market.csv exist
datadir="$myhome/Documents/marketbasket"
# Folder in hadoop where we will store marketbasket.dat and market.csv
hadoopfolder="/user/ashokharnal"
cd $datadir

# export hadoop library classpath
export HADOOP_CLASSPATH="/usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/*:$HADOOP_CLASSPATH"

# Copy data files to hadoop folder
hdfs dfs -put $datadir/marketbasket.dat marketbasket.dat

# Run mahout fpg command on .dat file. Output goes to 'patterns' folder. Note
# the -regex flag. It expects that item nos are separated by single space.
# We have specified a minimum support of 2 patterns (-s 2) in the dataset for
# it to be listed in the output. Maximum patterns are 50 (-k 50).

mahout fpg \
-i $hadoopfolder/marketbasket.dat \
-o patterns \
-k 50 \
-method mapreduce \
-regex '[\ ]' -s 2

# Next, we output 50 key-value (pattern) pairs from two part-r-0000? files
# from patterns folder on hadoop.

mahout seqdumper -i \
patterns/frequentpatterns/part-r-00000 \
-o $datadir/result_1.txt \
-n 50

mahout seqdumper \
-i patterns/frequentpatterns/part-r-00001 \
-o $datadir/result_2.txt \
-n 50

The resulting output in file result_1.txt and in result_2.txt is something as below:

Max Items to dump: 50
Key: 10: Value: ([10],40), ([126, 10],28), ([143, 10],21), ([134, 10],21), ([18, 10],20),
([143, 126, 10],19), ([134, 126, 10],19), ([287, 10],19), ([234, 10],19), ([134, 18, 10],18),
([126, 234, 10],17), ([126, 18, 10],17), ([134, 126, 18, 10],16), ([134, 143, 10],16),
([126, 287, 10],16), ([143, 287, 10],15), ([143, 234, 10],15), ([143, 18, 10],15),
([134, 287, 10],15), ([143, 126, 287, 10],14), ([143, 126, 234, 10],14), ([134, 18, 287, 10],14),
([134, 143, 18, 10],14), ([134, 143, 126, 10],14), ([134, 126, 287, 10],14), 
([134, 126, 18, 287, 10],13), ([143, 126, 18, 10],13), ([134, 143, 287, 10],13),
([134, 234, 10],13), ([134, 143, 18, 287, 10],12), 

The entry ([10],40) means item 10 occurs 40 times. And entry ([134, 126, 10],19) means combination of items 134, 126 and 10 occurs 19 times. However, it is difficult to understand item numbers. We convert item numbers back to item names by simple repeated use of sed scripts as below:

######Shell script to replace item-id by item name#######


#!/bin/sh
#
#________________________________________________________
# Find item no (same as col no) in file result_1.txt and replace by item-names.
# item-names are read from the first line of marketbasket.csv.
# File, final_result_1.txt, is the output file with item-names
#________________________________________________________

cd ~
myhome=`pwd`
datafolder="$myhome/Documents/marketbasket"
cd $datafolder
rptfile="result_1.txt"

#Read first line of marketbasket.csv file
line=$(head -n 1 marketbasket.csv)

# Trim all spaces (only) around commas (but not between two words)
echo $line | sed -e 's/, */,/g' > temp.txt
# Read temp.txt again & replace remaining space (between two words) by dash (-)
# ice cream becomes ice-cream
line=$(head -n 1 temp.txt)
echo $line | sed -e 's/ /-/g' > temp.txt

# Next, all item names are collected into an array
line=$(head -n 1 temp.txt)
IFS=', ' read -a array <<< "$line"

# In a loop, read all array values (ie item-names). One-by-one,
# column no in file result_1.txt is replaced with corresponding item-name
i=1
for element in "${array[@]}"
do
  # Replace pattern as: 23, 56, 78,
  sed "s/ $i\,/$element\,/g" $rptfile > temp1.txt
  cp temp1.txt tmp.txt
  rptfile="tmp.txt"

  # Replace pattern as: 23], 56], 78],
  sed "s/ $i\]\,/ $element\]\,/g" $rptfile > temp1.txt
  cp temp1.txt tmp.txt
  rptfile="tmp.txt"

  # Replace pattern as: [23, [56, [78, with item name
  sed "s/\[$i\,/\[$element\,/g" $rptfile > temp1.txt
  cp temp1.txt tmp.txt
  rptfile="tmp.txt"

  # Replace pattern as: [23], [56], [78] with item name
  sed "s/\[$i\]/\[$element\]/g" $rptfile > temp1.txt
  cp temp1.txt tmp.txt
  rptfile="tmp.txt"

  i=$(($i + 1))
done
# Moving results to file: final_result_1.txt
mv tmp.txt final_result_1.txt
# Delete temporary files
rm -f temp1.txt
rm -f temp.txt
##########
# Similarly convert file result_2.txt
##########

The resulting output in file, final_result_1.txt, is as follows:


Key: 10: Value: ([Dishwasher-Detergent],40), ([2pct.-Milk, Dishwasher-Detergent],28),
([White-Bread, Dishwasher-Detergent],21), ([Eggs, Dishwasher-Detergent],21),
([Potato-Chips, Dishwasher-Detergent],20), ([White-Bread,2pct.-Milk, Dishwasher-Detergent],19), 
([Eggs,2pct.-Milk, Dishwasher-Detergent],19), ([Aspirin, Dishwasher-Detergent],19),
([Wheat-Bread, Dishwasher-Detergent],19), ([Eggs,Potato-Chips, Dishwasher-Detergent],18),
([2pct.-Milk,Wheat-Bread, Dishwasher-Detergent],17), ([2pct.-Milk,Potato-Chips, Dishwasher-
Detergent],17), ([Eggs,2pct.-Milk,Potato-Chips, Dishwasher-Detergent],16), ([Eggs,White-Bread, 
Dishwasher-Detergent],16), ([2pct.-Milk,Aspirin, Dishwasher-Detergent],16),

Incidentally, if you had wanted to use ‘market.csv’ (the second output file from awk-script) for pattern analysis, then mahout command would have been as below. The default -regex pattern is [[ , ]*[,| ][ , ]*] and expects comma separated IDs as in market.csv.

mahout fpg \
-i $hadoopfolder/market.csv \
-o patternscsv \
-k 50 -method mapreduce \
-s 2 

In the mahout fpg command, the method ( -method) used is ‘mapreduce’. Instead, sequential, method could have been used. In this method the ‘.dat’ file is read from Linux file system (not hadoop file system) and, therefore, there cannot be parallel analysis of file-data. However, results are stored in hadoop file system and can be extracted using seqdumper. You may find it useful to refer to this Apache site.

A number of data sets for frequent pattern analysis are available Frequent Itemset Mining Dataset Repository. Extended Bakery dataset is available here. An excellent analysis is available here.

Advertisements

Tags: , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: