Text clustering using Mahout command line–Step-by-Step

Here is a step by step guide to Mahout command line text clustering. I have Cloudera hadoop ecosystem installed on a single machine (CentOS 6.5). Mahout was installed thereafter with the simple command:

# yum install mahout

The installed mahout is version 0.7. Check as:

$rpm -ql  mahout

For this experiment copy a few Wikipedia articles to your favourite text editor and save them as text files in a folder. I copied the articles related to Finance and Quantum Mechanics (folder repo/mytext/). Following is the shell script that analyses the text files, clusters them and finally prints two clusters along with a list of files. I use k-means clustering.

############Shell Script########################
#!/bin/sh
cd ~
# What is my home folder
myhome=`pwd`
# This file stores analysis results
resultFile=”$myhome/repo/result.txt”
# Folder containing Wikipedia articles
textFolder=”$myhome/repo/mytext”
# Folder on hadoop wherein other folders
#   will be created subsequently
hdfsFolder=”/user/ashokharnal”

# Step 1: Mahout will need to know path to hadoop jar files.
#   In Cloudera installed on CentOS, the path is as below:

export HADOOP_CLASSPATH=”/usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/*:$HADOOP_CLASSPATH”

# Step 2:Copy now your Wiki text folder to hdfs folder
#  A folder /user/ashokharnal/mytext is created:

hdfs dfs -put  $textFolder/  $hdfsFolder/

# Step 3: Convert text files in the hadoop folder to sequence format:

mahout seqdirectory  \
-i  $hdfsFolder/mytext \
-o  $hdfsFolder/mytext-seq  \
-c  UTF-8 \
-chunk 5

# Step 4: Convert sequence format to sparse vector format
#    Output stored in mytext-vectors
#    Flag -nv also keeps file names

mahout seq2sparse \
-nv   -i  $hdfsFolder/mytext-seq/  \
-o   $hdfsFolder/mytext-vectors

# Step 5: Create two k-means cluster now:

mahout kmeans -i $hdfsFolder/mytext-vectors/tfidf-vectors/  \
-c $hdfsFolder/mytext-kmeansSeed  \
-o $hdfsFolder/mytext-clusters   \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure  \
– -clustering  -cl  -cd  0.1  -x  10  -k  2  -ow

(Flag clustering above has two consecutive dashes before it without any spaces)

# Step 6: Print output to a text file on Linux file system:

mahout clusterdump  -i  $hdfsFolder/mytext-clusters/clusters-0   \
-o  $resultFile   \
-d  $hdfsFolder/mytext-vectors/dictionary.file-0 \
-b  100   \
-p  $hdfsFolder/mytext-clusters/clusteredPoints    \
-dt  sequencefile   -n  20

cat $resultFile

# Step 7: Print file names vs clusters using mahout seqdumper
#   In the result (tmp.txt), replace anything between [  ] with
#     spaces and sort the results in order of key value:

mahout seqdumper  \
-i $hdfsFolder/mytext-clusters/clusteredPoints/part-m-00000 > tmp.txt

# Use sed, stream editor, to parse the above output

sed -e “s/= \[.*\]/ /” tmp.txt  |  sort
rm -r -f tmp.txt
############End of shell script############

The output of sed command will be something like:

Input Path: /user/ashokharnal/mytext-clusters/clusteredPoints/part-m-00000
Key: 0: Value: 1.0: /Corporate Finance
Key: 0: Value: 1.0: /Finance
Key: 0: Value: 1.0: /Financial Capitalism
Key: 0: Value: 1.0: /Financial Markets
Key: 0: Value: 1.0: /Personal Finance
Key: 7: Value: 1.0: /Introduction to Quantum mechanics
Key: 7: Value: 1.0: /Quantum Circuit
Key: 7: Value: 1.0: /Quantum Computer
Key: 7: Value: 1.0: /Quantum mechanics

Two file clusters have been created, one labelled Key=0 and the other Key=7.

For theory behind text-clustering explained in a simple way refer to this three-part excellent tutorial.

Advertisements

Tags: , , ,

One Response to “Text clustering using Mahout command line–Step-by-Step”

  1. k-means document clustering using Apache Mahout command line – datasciencetutos Says:

    […] https://ashokharnal.wordpress.com/2014/02/09/text-clustering-using-mahout-command-line-step-by-step/ […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: