Experimenting with Random Projections—A Non-maths approach

April 1, 2015

Random Projections is a dimensionality reduction technique that is very simple to implement. Yet, understanding its mathematics is equally tough. Johnson-Lindenstrauss lema that forms the basis of this technique uses mathematical symbols that are hard to comprehend even for a person who has had some deeper background in mathematics. In this blog, I will experiment with the results that follow from this lemma. Results are easy to understand and experiment with. Such experimentation does provide a lot of insight as to what happens behind the scenes and how the dimensionality reduction technique works.

Dimensionality Reduction is achieved in this manner: For a feature matrix, A, of dimensions r X n, a projection matrix, P of dimensions n X k is created (where k << n). Matrix multiplication of A with P renders the transformed feature matrix, S, with dimensions r X k. Relative euclidean distances between points of A are maintained between points in S. Note that A has r-rows, one for each object and hence r-points in n-dimensional space; S also has r-objects but in k-dimensional space. While reducing dimensions of the feature matrix A to k-dimensions, relative Euclidean distances between points in the n-dimension space are maintained in the newly projected k-dimensions. In the discussions that follows, we call P, the projection matrix.

We will work with MNIST handwritten digits set. You can download it from here. Our tasks are, a) to create a projection matrix for this data set, b) use projection matrix to transform to lower-dimensions the train digit-dataset, c) train k-nn classifier on this reduced-dimension (training) digit-set, d) transform also the test digit-dataset to lower dimensions using the same projection matrix, and e) make predictions and determine the prediction accuracy for transformed test-set. We will do this for projection matrices constructed in three different manner and also by varying the number of projections.

Constructing Projection matrices

There are three ways a projection matrix, P, of dimension r X k can be constructed. All the three ways are independent of data in the feature-data set (i.e. MNIST data set in our case). One method is to create an array of random numbers having N(0,1) distribution (i.e. standard normal distribution with mean 0 and standard deviation 1). The number of elements in the array are: r X k. This array is then shaped into a matrix of r X k size. The other two ways are mentioned in the paper by Dimitris Achlioptas, ‘Database-friendly random projections: Johnson-Lindenstrauss with binary coins‘. In effect the paper says that the elements of the projection matrix may be filled with two numbers, either -1 or +1. Each of the two has equal probability of selection that is 0.5. The third way is to fill up the projection matrix with any of the three numbers: sqrt(3), 0 or -sqrt(3) having probability distribution 1/6, 2/3 and 1/6. We are going to try all the three methods: while we will use python for training K-NN classifier on MNIST dataset yet we will not use any pre-built sklearn class to create our projection matrix; we will do work from scratch (so to say). For example, the following three-line code creates a projection matrix with dimension (11000,45), filled with N(0,1) random numbers.

rn=np.random.standard_normal(11000*45)
rn=rn.reshape(11000,45)
transformation_matrix = np.asmatrix(rn)
  

Experimenting with projection matrix filled with N(0,1) numbers

As random projections preserve the relative Euclidean distances between points, it will be appropriate to use a classifier that bases its classification decisions on Euclidean distance as one of the proximity measures. Hence our choice for K-Nearest Neighbour classifier. The training data has 785 columns, one for label and 784 for pixel intensity values on gray scale. Each handwritten image has 28 X 28 =784 pixels. As is well known there is a high degree of correlation between adjacent pixel intensities and hence there is a large scope for dimensionality reduction. We will vary dimensions from 20 to 500 and compare the accuracy of classification as against when all 784 columns are used for prediction. The following python code does this. The code is liberally commented.


import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn import neighbors
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
%pylab

# Change directory to data folder
%cd /home/ashokharnal/

# Upload MNIST database
# Header is at row 0
digits = pd.read_csv("/home/ashokharnal/train.csv", header=0, sep=',')	
# Know its dimesions
digits.shape
# Separate target and predictors
y=digits['label']	# Target
X=digits.iloc[:,1:]	# Predictors

# Convert X to float for scaling it (it is a requirement of scaling class)
Z=X.astype('float64')

# Pre-process data (pixel values). Scale it between 0 and 1
min_max_scaler = preprocessing.MinMaxScaler()
X_train = min_max_scaler.fit_transform(Z)
# X_train is numpy array. Convert it to data frame (though this conversion is not needed)
F=pd.DataFrame(X_train)
# Reassign it column names
F.columns=X.columns
# Split predictors and targets in ratio of 70:30 
split = train_test_split(F,y, test_size = 0.3,random_state = 42)
# split returns four arrays
(train_X, test_X, train_y, test_y) = split

print("Building k-nn model with all columns")
clf = neighbors.KNeighborsClassifier()
clf.fit(train_X,train_y)
Z = clf.predict(test_X)
max_accuracy=clf.score(test_X,test_y)
print("Model max possible accuracy : "+ str(max_accuracy))

# Generate 'num' integer values between 20 and 500.
num=20
projected_dimensions = np.int32(np.linspace(20, 500, num))

# An empty list to store accuracy scores
accuracy_list = []
# Begin iteration per projection
for r_projections in projected_dimensions:
	print("Next iteration:-")
	print("    No of columns to be projected: "+str(r_projections))
	
        # Create an array of N(0,1) values
	rn=np.random.standard_normal(size=(digits.shape[1] - 1)*r_projections)
	rn=rn.reshape((digits.shape[1] - 1),r_projections)
	transformation_matrix=np.asmatrix(rn)
	data=np.asmatrix(train_X)

        # Multiply data matrix with projection matrix to reduce dimensions 
        # X is the transformed matrix 
        X = data * transformation_matrix
	print("    Training digit data transformed to dimension: " + str(X.shape))

	# Train a classifier on the transformed matrix
	print("    Creating model")
	model = neighbors.KNeighborsClassifier()
	model.fit(X,train_y)
	
        # Evaluate the model on test set
        # Ist reduce dimensions of test data using the same transformation
	test_X=np.asmatrix(test_X)
	test = test_X * transformation_matrix

        # Make prediction and check score
	acc=model.score(test,test_y)
	print("    Accuracy achieved: "+str(acc) )
	accuracy_list.append(acc)
	print("------------")    

# create graph of accuracies achieved
plt.figure()
plt.suptitle("Accuracy of Random Projections on MNIST handwritten digits")
plt.xlabel("No of projected dimensions")
plt.ylabel("Accuracy")
plt.xlim([2, 500])
plt.ylim([0, 1.0])
 
# plot the maximum achievable accuracy against random projection accuracies
plt.plot(projected_dimensions, [max_accuracy] * len(accuracy_list), color = "r")
plt.plot(projected_dimensions, accuracy_list)
plt.show()

Accuracy results can be seen in the following graph. Even for dimension as low as 20, prediction accuracy is remarkable.

Random projections as per Gaussian distribution

Random projections as per Gaussian distribution. Red line is max possible accuracy with all 784 columns considered.

It can be observed that with dimensions a little higher than 200 as much accuracy is achieved as with dimensions of 784. An account of accuracy as recorded is given below.


Projection matrix with N(0,1) numbers
=======================================

Model max possible accuracy : 0.965317460317

First iteration:-
No of columns  projected: 20
Training digit-data transformed to dimension: (29400, 20)
Accuracy achieved: 0.837063492063
------------
Next iteration:-
No of columns  projected: 45
Training digit-data transformed to dimension: (29400, 45)
Accuracy achieved: 0.923333333333
------------
Next iteration:-
No of columns  projected: 70
Training digit-data transformed to dimension: (29400, 70)
Accuracy achieved: 0.94380952381
------------
Next iteration:-
No of columns  projected: 95
Training digit-data transformed to dimension: (29400, 95)
Accuracy achieved: 0.952063492063
------------
Next iteration:-
No of columns  projected: 121
Training digit-data transformed to dimension: (29400, 121)
Accuracy achieved: 0.955238095238
------------
Next iteration:-
No of columns  projected: 146
Training digit-data transformed to dimension: (29400, 146)
Accuracy achieved: 0.955793650794
------------
Next iteration:-
No of columns  projected: 171
Training digit-data transformed to dimension: (29400, 171)
Accuracy achieved: 0.95626984127
------------
Next iteration:-
No of columns  projected: 196
Training digit-data transformed to dimension: (29400, 196)
Accuracy achieved: 0.956984126984
------------
Next iteration:-
No of columns  projected: 222
Training digit-data transformed to dimension: (29400, 222)
Accuracy achieved: 0.959285714286
------------
Next iteration:-
No of columns  projected: 247
Training digit-data transformed to dimension: (29400, 247)
Accuracy achieved: 0.961031746032
------------
Next iteration:-
No of columns  projected: 272
Training digit-data transformed to dimension: (29400, 272)
Accuracy achieved: 0.960476190476
------------
Next iteration:-
No of columns  projected: 297
Training digit-data transformed to dimension: (29400, 297)
Accuracy achieved: 0.960793650794
------------
Next iteration:-
No of columns  projected: 323
Training digit-data transformed to dimension: (29400, 323)
Accuracy achieved: 0.960952380952
------------
Next iteration:-
No of columns  projected: 348
Training digit-data transformed to dimension: (29400, 348)
Accuracy achieved: 0.960793650794
------------
Next iteration:-
No of columns  projected: 373
Training digit-data transformed to dimension: (29400, 373)
Accuracy achieved: 0.962380952381
------------
Next iteration:-
No of columns  projected: 398
Training digit-data transformed to dimension: (29400, 398)
Accuracy achieved: 0.960714285714
------------
Next iteration:-
No of columns  projected: 424
Training digit-data transformed to dimension: (29400, 424)
Accuracy achieved: 0.961825396825
------------
Next iteration:-
No of columns  projected: 449
Training digit-data transformed to dimension: (29400, 449)
Accuracy achieved: 0.961984126984
------------
Next iteration:-
No of columns  projected: 474
Training digit-data transformed to dimension: (29400, 474)
Accuracy achieved: 0.963015873016
------------
Next iteration:-
No of columns  projected: 500
Training digit-data transformed to dimension: (29400, 500)
Accuracy achieved: 0.963492063492
------------

Experimenting with projection matrix of binary numbers

We now fill the projection matrix with binary numbers: -1 and +1, each with probability 0.5. The following few lines of python code will create such a projection matrix. For an explanation of how this code works, you may please refer here.


def weighted_values(values, probabilities, size):
    bins = np.add.accumulate(probabilities)
    return values[np.digitize(np.random.random_sample(size), bins)]

values = np.array([1.0, -1.0])
probabilities = np.array([0.5,0.5])
#  1.0 occurs with p=0.5 (1st value) and -1.0 occurs with p=0.5 (2nd value)
rn=weighted_values(values, probabilities, (11000*45))
rn=rn.reshape(11000,45)
transformation_matrix=np.asmatrix(rn)
  

The k-nn training model code on the lines similar to earlier code is below. I have removed most comments.

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn import neighbors
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
%pylab

%cd /home/ashokharnal/Documents/
digits = pd.read_csv("/home/ashokharnal/Documents/train.csv", header=0, sep=',')	

# Separate target and predictors
y=digits['label']	# Target
X=digits.iloc[:,1:]	# Predictors

# Convert X to float for scaling it (it is a requirement of scaling)
Z=X.astype('float64')

# Preprocess data. Scale it between 0 and 1
min_max_scaler = preprocessing.MinMaxScaler()
X_train = min_max_scaler.fit_transform(Z)

# Convert it to data frame 
F=pd.DataFrame(X_train)
F.columns=X.columns
# Split predictors and targets in ratio of 70:30 
split = train_test_split(F,y, test_size = 0.3,random_state = 42)
(train_X, test_X, train_y, test_y) = split

print("Building k-nn model with all columns")
clf = neighbors.KNeighborsClassifier()
clf.fit(train_X,train_y)
Z = clf.predict(test_X)
max_accuracy=clf.score(test_X,test_y)
print("Model max possible accuracy : "+ str(max_accuracy))

# Function to generate discrete random variables with specified weights
def weighted_values(values, probabilities, size):
    bins = np.add.accumulate(probabilities)
    return values[np.digitize(np.random.random_sample(size), bins)]

values = np.array([1.0, -1.0])
probabilities = np.array([0.5,0.5])

# Get 'num' integer values between 20 and 500.
num=20
projected_dimensions = np.int32(np.linspace(20, 500, num))
accuracy_list = []

for r_projections in projected_dimensions:
	print("Next iteration:-")
	print("    No of columns to be projected: "+str(r_projections))
	# Create an array of binary values
	rn=weighted_values(values, probabilities, (digits.shape[1] - 1)*r_projections)
	rn=rn.reshape((digits.shape[1] - 1),r_projections)
	transformation_matrix=np.asmatrix(rn)
	data=np.asmatrix(train_X)
	X = data * transformation_matrix
	print("    Training digit data transformed to dimension: " + str(X.shape))

	# Train a classifier on the random projection
	print("    Creating model")
	model = neighbors.KNeighborsClassifier()
	model.fit(X,train_y)
	
	test_X=np.asmatrix(test_X)
	test = test_X * transformation_matrix
	acc=model.score(test,test_y)
	print("    Accuracy achieved: "+str(acc) )
	accuracy_list.append(acc)
	print("------------")    

# Create accuracy graph
plt.figure()
plt.suptitle("Accuracy of Random Projections on MNIST handwritten digits")
plt.xlabel("No of projected dimensions")
plt.ylabel("Accuracy")
plt.xlim([2, 500])
plt.ylim([0, 1.0])
 
# plot the maximum achievable accuracy against random projection accuracies
plt.plot(projected_dimensions, [max_accuracy] * len(accuracy_list), color = "r")
plt.plot(projected_dimensions, accuracy_list)
plt.show()

The following graph displays the accuracy results.

Projection matrix with binary values

Projection matrix with binary values. Red line is max possible accuracy with all 784 columns considered.

Again as can be seen that even with such a simple projection matrix, dimensions of around 200 give near maximum possible accuracy. An account of accuracy achieved vs. projected columns is given below. You will note that dimensions of 45 provide prediction accuracy of around 93%.

Using +1 and -1 as matrix elements
---------------------------------

Model max possible accuracy : 0.965317460317

First iteration:-
No of columns  projected: 20
Training digit-data transformed to dimension: (29400, 20)
Accuracy achieved: 0.842698412698
------------
Next iteration:-
No of columns  projected: 45
Training digit-data transformed to dimension: (29400, 45)
Accuracy achieved: 0.928492063492
------------
Next iteration:-
No of columns  projected: 70
Training digit-data transformed to dimension: (29400, 70)
Accuracy achieved: 0.948253968254
------------
Next iteration:-
No of columns  projected: 95
Training digit-data transformed to dimension: (29400, 95)
Accuracy achieved: 0.949682539683
------------
Next iteration:-
No of columns  projected: 121
Training digit-data transformed to dimension: (29400, 121)
Accuracy achieved: 0.957857142857
------------
Next iteration:-
No of columns  projected: 146
Training digit-data transformed to dimension: (29400, 146)
Accuracy achieved: 0.956507936508
------------
Next iteration:-
No of columns  projected: 171
Training digit-data transformed to dimension: (29400, 171)
Accuracy achieved: 0.955714285714
------------
Next iteration:-
No of columns  projected: 196
Training digit-data transformed to dimension: (29400, 196)
Accuracy achieved: 0.959126984127
------------
Next iteration:-
No of columns  projected: 222
Training digit-data transformed to dimension: (29400, 222)
Accuracy achieved: 0.958095238095
------------
Next iteration:-
No of columns  projected: 247
Training digit-data transformed to dimension: (29400, 247)

Accuracy achieved: 0.960238095238
------------
Next iteration:-
No of columns  projected: 272
Training digit-data transformed to dimension: (29400, 272)
Accuracy achieved: 0.961031746032
------------
Next iteration:-
No of columns  projected: 297
Training digit-data transformed to dimension: (29400, 297)
Accuracy achieved: 0.960317460317
------------
Next iteration:-
No of columns  projected: 323
Training digit-data transformed to dimension: (29400, 323)
Accuracy achieved: 0.960476190476
------------
Next iteration:-
No of columns  projected: 348
Training digit-data transformed to dimension: (29400, 348)
Accuracy achieved: 0.960873015873
------------
Next iteration:-
No of columns  projected: 373
Training digit-data transformed to dimension: (29400, 373)
Accuracy achieved: 0.961666666667
------------
Next iteration:-
No of columns  projected: 398
Training digit-data transformed to dimension: (29400, 398)
Accuracy achieved: 0.96253968254
------------
Next iteration:-
No of columns  projected: 424
Training digit-data transformed to dimension: (29400, 424)
Accuracy achieved: 0.96126984127
------------
Next iteration:-
No of columns  projected: 449
Training digit-data transformed to dimension: (29400, 449)
Accuracy achieved: 0.960238095238
------------
Next iteration:-
No of columns  projected: 474
Training digit-data transformed to dimension: (29400, 474)
Accuracy achieved: 0.962222222222
------------
Next iteration:-
No of columns  projected: 500
Training digit-data transformed to dimension: (29400, 500)
Accuracy achieved: 0.962619047619
------------

Experimenting with projection matrix–Another probability distribution

We will now construct projection matrix using another simple probability distribution. This distribution has three possible values: sqrt(3), 0 and -sqrt(3) i.e 1.7320508075, 0.0, and -1.7320508075 with probabilities 1/6, 2/3 and 1/6. Similar to earlier, the following python code will populate a matrix with this distribution:

def weighted_values(values, probabilities, size):
    bins = np.add.accumulate(probabilities)
    return values[np.digitize(np.random.random_sample(size), bins)]

values = np.array([1.7320508075, 0.0, -1.7320508075])
probabilities = np.array([1.0/6.0, 2.0/3.0, 1.0/6.0])
rn=weighted_values(values, probabilities, (11000*45)
rn=rn.reshape(11000,45)
transformation_matrix=np.asmatrix(rn)
  

The python code for K-Nearest Neighbour modelling using this transformation matrix to reduce dimensions is given below. We have removed all comments.


import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn import neighbors
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
%pylab

%cd /home/ashokharnal/Documents/
digits = pd.read_csv("/home/ashokharnal/Documents/train.csv", header=0, sep=',')	
y=digits['label']	# Target
X=digits.iloc[:,1:]	# Predictors
Z=X.astype('float64')

min_max_scaler = preprocessing.MinMaxScaler()
X_train = min_max_scaler.fit_transform(Z)
F=pd.DataFrame(X_train)
F.columns=X.columns
split = train_test_split(F,y, test_size = 0.3,random_state = 42)
(train_X, test_X, train_y, test_y) = split

print("Building k-nn model with all columns")
clf = neighbors.KNeighborsClassifier()
clf.fit(train_X,train_y)
Z = clf.predict(test_X)
max_accuracy=clf.score(test_X,test_y)

print("Model max possible accuracy : "+ str(max_accuracy))

def weighted_values(values, probabilities, size):
    bins = np.add.accumulate(probabilities)
    return values[np.digitize(np.random.random_sample(size), bins)]

values = np.array([1.7320508075, 0.0, -1.7320508075])
probabilities = np.array([1.0/6.0, 2.0/3.0, 1.0/6.0])

num=20
projected_dimensions = np.int32(np.linspace(20, 500, num))
accuracy_list = []

for r_projections in projected_dimensions:
	print("Next iteration:-")
	print("   No of columns to be projected: "+str(r_projections))
	# Create an array of N(0,1) values
	rn=weighted_values(values, probabilities, (digits.shape[1] - 1)*r_projections)
	rn=rn.reshape((digits.shape[1] - 1),r_projections)
	transformation_matrix=np.asmatrix(rn)
	data=np.asmatrix(train_X)
	X = data * transformation_matrix
	print("   Training digit-data transformed to dimension: " + str(X.shape))

	# train a classifier on the random projection
	print("   Creating model")
	model = neighbors.KNeighborsClassifier()
	model.fit(X,train_y)
	# evaluate the model and update accuracy_list
	test_X=np.asmatrix(test_X)
	test = test_X * transformation_matrix
	acc=model.score(test,test_y)
	print("   Accuracy achieved: "+str(acc) )
	accuracy_list.append(acc)
	print("------------")    

plt.figure()
plt.suptitle("Accuracy of Random Projections on MNIST handwritten digits")
plt.xlabel("No of projected dimensions")
plt.ylabel("Accuracy")
plt.xlim([2, 500])
plt.ylim([0, 1.0])
 
plt.plot(projected_dimensions, [max_accuracy] * len(accuracy_list), color = "r")
plt.plot(projected_dimensions, accuracy_list)
plt.show()

Graph of accuracy score vs projected columns is as below.

Projection matrix with three possible values.

Projection matrix with three possible values. Red line is max possible accuracy with all 784 columns considered.

Again, as before, when the number of columns increase, the predicted accuracy increases and is maximum achievable when number of columns are a little more than 200. An account of accuracy achieved vs. projected columns in this case is given below. As before, even with dimensions as low as 45, prediction accuracy of digits is around 93%.


Using sqrt(3), 0 and -sqrt(3) as matrix elements
---------------------------------
First iteration:-
No of columns  projected: 20
Training digit-data transformed to dimension: (29400, 20)
Accuracy achieved: 0.829523809524
------------
Next iteration:-
No of columns  projected: 45
Training digit-data transformed to dimension: (29400, 45)
Accuracy achieved: 0.928015873016
------------
Next iteration:-
No of columns  projected: 70
Training digit-data transformed to dimension: (29400, 70)
Accuracy achieved: 0.94253968254
------------
Next iteration:-
No of columns  projected: 95
Training digit-data transformed to dimension: (29400, 95)
Accuracy achieved: 0.950317460317
------------
Next iteration:-
No of columns  projected: 121
Training digit-data transformed to dimension: (29400, 121)
Accuracy achieved: 0.951825396825
------------
Next iteration:-
No of columns  projected: 146
Training digit-data transformed to dimension: (29400, 146)
Accuracy achieved: 0.954285714286
------------
Next iteration:-
No of columns  projected: 171
Training digit-data transformed to dimension: (29400, 171)
Accuracy achieved: 0.959206349206
------------
Next iteration:-
No of columns  projected: 196
Training digit-data transformed to dimension: (29400, 196)
Accuracy achieved: 0.958095238095
------------
Next iteration:-
No of columns  projected: 222
Training digit-data transformed to dimension: (29400, 222)
Accuracy achieved: 0.958650793651
------------
Next iteration:-
No of columns  projected: 247
Training digit-data transformed to dimension: (29400, 247)
Accuracy achieved: 0.960634920635
------------
Next iteration:-
No of columns  projected: 272
Training digit-data transformed to dimension: (29400, 272)
Accuracy achieved: 0.96
------------
Next iteration:-
No of columns  projected: 297
Training digit-data transformed to dimension: (29400, 297)
Accuracy achieved: 0.960158730159
------------
Next iteration:-
No of columns  projected: 323
Training digit-data transformed to dimension: (29400, 323)
Accuracy achieved: 0.959603174603
------------
Next iteration:-
No of columns  projected: 348
Training digit-data transformed to dimension: (29400, 348)
Accuracy achieved: 0.961507936508
------------
Next iteration:-
No of columns  projected: 373
Training digit-data transformed to dimension: (29400, 373)
Accuracy achieved: 0.962222222222
------------
Next iteration:-
No of columns  projected: 398
Training digit-data transformed to dimension: (29400, 398)
Accuracy achieved: 0.963253968254
------------
Next iteration:-
No of columns  projected: 424
Training digit-data transformed to dimension: (29400, 424)
Accuracy achieved: 0.961507936508
------------
Next iteration:-
No of columns  projected: 449
Training digit-data transformed to dimension: (29400, 449)
Accuracy achieved: 0.963888888889
------------
Next iteration:-
No of columns  projected: 474
Training digit-data transformed to dimension: (29400, 474)
Accuracy achieved: 0.963095238095
------------
Next iteration:-
No of columns  projected: 500
Training digit-data transformed to dimension: (29400, 500)
Accuracy achieved: 0.963888888889
------------

This finishes our experiments in random projections.

Advertisements

Facial Keypoints detection–Using R, deep learning and H2O

March 22, 2015

Computer vision is a hotly pursued subject. While web cameras on computers are good at taking pictures they have no intelligence–none at all–at recognizing if two images are same. Computer vision pushes machine learning to its limits. Deep learning neural networks composed of many hidden layers are used on powerful machines to detect, for example, if two images of faces are of the same person. Facebook with its DeepFace  technology has achieved near human level capability in recognizing faces.

Kaggle is hosting a competition in recognizing faces. The methodology for recognizing faces is this: discover the location of key points on faces such as x-coordinate of left-eye centre, y-coordinate of left-eye centre and so on. There are 30 key points. When in two images of faces locations of these key points match, the two images are presumed to be of the same face.

In this exercise, I have used deep learning algorithm of H2O. H2O is very easy to install either on stand-alone basis or within R. Its deep learning algorithm can be invoked from within R. Limitation of H2O deep learning is that there can be only one response variable. But here we have 30. So, we make modelling and predictions, one at a time, instead of all 30 at one go. You can loop modelling and prediction sequence over a number of location points depending upon the capacity of your machine. My machine has 16GB of RAM. I made predictions for 5 location-points in one for-loop sequence; and then for another 5 and so on till for all 30. I was able to achieve Kaggle score of 3.55854. Even if you have 8GB machine, do not be disappointed; you can get a respectable score.

Installing H2O in R on CentOS

Installation in R is easy. First use yum to install R-devel (yum install R-devel) and then libcurl-devel.x86_64 (yum install libcurl-devel.x86_64). Next, install dependencies mentioned here. My version of R is 3.1.1. Many dependencies such as ‘methods’, ‘tools’, ‘utils’ and ‘stats’ already come bundled with R and are pre-installed. Then download H2O, unzip it and install its R package from within R as mentioned on this page. Installation process should not take more than 20 minutes at the maximum.

Data structure and transformation

You must have downloaded training and test files from Kaggle. Training file has 7049 images of faces. Test file has 1783 images. First two lines of training.csv are as below:

bash-4.1$ head -2 training.csv 
left_eye_center_x,left_eye_center_y,right_eye_center_x,right_eye_center_y,left_eye_inner_corner_x,left_eye_inner_corner_y,left_eye_outer_corner_x,left_eye_outer_corner_y,right_eye_inner_corner_x,right_eye_inner_corner_y,right_eye_outer_corner_x,right_eye_outer_corner_y,left_eyebrow_inner_end_x,left_eyebrow_inner_end_y,left_eyebrow_outer_end_x,left_eyebrow_outer_end_y,right_eyebrow_inner_end_x,right_eyebrow_inner_end_y,right_eyebrow_outer_end_x,right_eyebrow_outer_end_y,nose_tip_x,nose_tip_y,mouth_left_corner_x,mouth_left_corner_y,mouth_right_corner_x,mouth_right_corner_y,mouth_center_top_lip_x,mouth_center_top_lip_y,mouth_center_bottom_lip_x,mouth_center_bottom_lip_y,Image
66.0335639098,39.0022736842,30.2270075188,36.4216781955,59.582075188,39.6474225564,73.1303458647,39.9699969925,36.3565714286,37.3894015038,23.4528721805,37.3894015038,56.9532631579,29.0336481203,80.2271278195,32.2281383459,40.2276090226,29.0023218045,16.3563789474,29.6474706767,44.4205714286,57.0668030075,61.1953082707,79.9701654135,28.6144962406,77.3889924812,43.3126015038,72.9354586466,43.1307067669,84.4857744361,238 236 237 238 240 240 239 241 241 243 240 239 231 212 190 173 148 122 104 92 79 73 74 73 73 74 81 74 60 64 75 86 93 102 100 105 109 114 121 127 132 134 137 137 140 139 138 137 137 140 141 143 144 147 148 149 147 147 148 145 147 144 146 147 147 143 134 130 130 128 116 104 98 90 82 78 85 88 86 80 77 87 108 111 115 128 133 188 242 252 250 248 251 250 250 250 235 238 236 238 238 237 238 242 241 239 237 233 215 195 187 156 119 103 93 78 68 73 75 75 72 75 70 61 66 77 91 96 106 108 113 120 125 131 134 138 135 138 139 145 144 144 142 140 141 141 148 147 150 149 152 151 149 150 147 148 144 148 144 146 146 143 139 128 132 135 128 112 104 97 87 78 79 83 85 83 75 75 89 109 111 117 117 130 194 243 251 249 250 249 250 251 237 236 237 238 237 238 241 238 238 238 241 221 195 187 163 124 106 95 81 68 70 73 73 72 73 69 65 74 82 94 103 110 111 119 127 135 140 139 

Each line has 30 facial key-point locations followed by 96 X 96 = 9216 pixel-intensity values (we have removed last many pixels in the second row above for brevity). The first line is header line with names of fields–there are 31 names. 30 location-point names and one name for ‘Image’ field. Thus, all pixel intensity values are in one field separated by space. The data structure when read in R is as follows. All location points are numeric. And ‘Image’ with all pixel values is one huge character field.


setwd("/home/ganesh/facial_key_points/")
# Read tarining file. Note that all fields are comma separated
#   but each Image has 96*96 =9216 values, space separated
#    to avoid Image being treaded as factor, set stringsAsFactors=FALSE
#      default is TRUE
df.train<-read.csv("training.csv", stringsAsFactors=FALSE, header=TRUE)
> str(df.train)
'data.frame':	7049 obs. of  31 variables:
 $ left_eye_center_x        : num  66 64.3 65.1 65.2 66.7 ...
 $ left_eye_center_y        : num  39 35 34.9 37.3 39.6 ...
 $ right_eye_center_x       : num  30.2 29.9 30.9 32 32.2 ...
 $ right_eye_center_y       : num  36.4 33.4 34.9 37.3 38 ...
 $ left_eye_inner_corner_x  : num  59.6 58.9 59.4 60 58.6 ...
 $ left_eye_inner_corner_y  : num  39.6 35.3 36.3 39.1 39.6 ...
 $ left_eye_outer_corner_x  : num  73.1 70.7 71 72.3 72.5 ...
 $ left_eye_outer_corner_y  : num  40 36.2 36.3 38.4 39.9 ...
 $ right_eye_inner_corner_x : num  36.4 36 37.7 37.6 37 ...
 $ right_eye_inner_corner_y : num  37.4 34.4 36.3 38.8 39.1 ...
 $ right_eye_outer_corner_x : num  23.5 24.5 25 25.3 22.5 ...
 $ right_eye_outer_corner_y : num  37.4 33.1 36.6 38 38.3 ...
 $ left_eyebrow_inner_end_x : num  57 54 55.7 56.4 57.2 ...
 $ left_eyebrow_inner_end_y : num  29 28.3 27.6 30.9 30.7 ...
 $ left_eyebrow_outer_end_x : num  80.2 78.6 78.9 77.9 77.8 ...
 $ left_eyebrow_outer_end_y : num  32.2 30.4 32.7 31.7 31.7 ...
 $ right_eyebrow_inner_end_x: num  40.2 42.7 42.2 41.7 38 ...
 $ right_eyebrow_inner_end_y: num  29 26.1 28.1 31 30.9 ...
 $ right_eyebrow_outer_end_x: num  16.4 16.9 16.8 20.5 15.9 ...
 $ right_eyebrow_outer_end_y: num  29.6 27.1 32.1 29.9 30.7 ...
 $ nose_tip_x               : num  44.4 48.2 47.6 51.9 43.3 ...
 $ nose_tip_y               : num  57.1 55.7 53.5 54.2 64.9 ...
 $ mouth_left_corner_x      : num  61.2 56.4 60.8 65.6 60.7 ...
 $ mouth_left_corner_y      : num  80 76.4 73 72.7 77.5 ...
 $ mouth_right_corner_x     : num  28.6 35.1 33.7 37.2 31.2 ...
 $ mouth_right_corner_y     : num  77.4 76 72.7 74.2 77 ...
 $ mouth_center_top_lip_x   : num  43.3 46.7 47.3 50.3 45 ...
 $ mouth_center_top_lip_y   : num  72.9 70.3 70.2 70.1 73.7 ...
 $ mouth_center_bottom_lip_x: num  43.1 45.5 47.3 51.6 44.2 ...
 $ mouth_center_bottom_lip_y: num  84.5 85.5 78.7 78.3 86.9 ...
 $ Image                    : chr  "238 236 237 238 240 240 239 241 241 243 240 239 231 212 190 173 148 122 104 92 79 73 74 73 73 74 81 74 60 64 75 86 93 102 100 1"| __truncated__ "219 215 204 196 204 211 212 200 180 168 178 196 194 196 203 209 199 192 197 201 207 215 199 190 182 180 183 190 190 176 175 175"| __truncated__ "144 142 159 180 188 188 184 180 167 132 84 59 54 57 62 61 55 54 56 50 60 78 85 86 88 89 90 90 88 89 91 94 95 98 99 101 104 107 "| __truncated__ "193 192 193 194 194 194 193 192 168 111 50 12 1 1 1 1 1 1 1 1 1 1 6 16 19 17 13 13 16 22 25 31 34 27 15 19 16 19 17 13 9 6 3 1 "| __truncated__ ...

It so happens that in some images some location-points are missing. A summary of data brings this out. It shows the number of missing values for every location column.

> summary(df.train)
 left_eye_center_x left_eye_center_y right_eye_center_x right_eye_center_y
 Min.   :22.76     Min.   : 1.617    Min.   : 0.6866    Min.   : 4.091    
 1st Qu.:65.08     1st Qu.:35.900    1st Qu.:28.7833    1st Qu.:36.328    
 Median :66.50     Median :37.528    Median :30.2514    Median :37.813    
 Mean   :66.36     Mean   :37.651    Mean   :30.3061    Mean   :37.977    
 3rd Qu.:68.02     3rd Qu.:39.258    3rd Qu.:31.7683    3rd Qu.:39.567    
 Max.   :94.69     Max.   :80.503    Max.   :85.0394    Max.   :81.271    
 NA's   :10        NA's   :10        NA's   :13         NA's   :13  
      
 left_eye_inner_corner_x left_eye_inner_corner_y left_eye_outer_corner_x
 Min.   :19.07           Min.   :27.19           Min.   :27.57          
 1st Qu.:58.04           1st Qu.:36.63           1st Qu.:71.72          
 Median :59.30           Median :37.88           Median :73.25          
 Mean   :59.16           Mean   :37.95           Mean   :73.33          
 3rd Qu.:60.52           3rd Qu.:39.26           3rd Qu.:75.02          
 Max.   :84.44           Max.   :66.56           Max.   :95.26          
 NA's   :4778            NA's   :4778            NA's   :4782 
          
 left_eye_outer_corner_y right_eye_inner_corner_x right_eye_inner_corner_y
 Min.   :26.25           Min.   : 5.751           Min.   :26.25           
 1st Qu.:36.09           1st Qu.:35.506           1st Qu.:36.77           
 Median :37.64           Median :36.652           Median :37.94           
 Mean   :37.71           Mean   :36.653           Mean   :37.99           
 3rd Qu.:39.37           3rd Qu.:37.754           3rd Qu.:39.19           
 Max.   :64.62           Max.   :70.715           Max.   :69.81           
 NA's   :4782            NA's   :4781             NA's   :4781       
     
 right_eye_outer_corner_x right_eye_outer_corner_y left_eyebrow_inner_end_x
 Min.   : 3.98            Min.   :25.12            Min.   :17.89           
 1st Qu.:20.59            1st Qu.:36.53            1st Qu.:54.52           
 Median :22.54            Median :37.87            Median :56.24           
 Mean   :22.39            Mean   :38.03            Mean   :56.07           
 3rd Qu.:24.24            3rd Qu.:39.41            3rd Qu.:57.95           
 Max.   :61.43            Max.   :70.75            Max.   :79.79           
 NA's   :4781             NA's   :4781             NA's   :4779         
   
 left_eyebrow_inner_end_y left_eyebrow_outer_end_x left_eyebrow_outer_end_y
 Min.   :15.86            Min.   :32.21            Min.   :10.52           
 1st Qu.:27.62            1st Qu.:77.67            1st Qu.:27.67           
 Median :29.53            Median :79.78            Median :29.77           
 Mean   :29.33            Mean   :79.48            Mean   :29.73           
 3rd Qu.:31.16            3rd Qu.:81.59            3rd Qu.:31.84           
 Max.   :60.88            Max.   :94.27            Max.   :60.50           
 NA's   :4779             NA's   :4824             NA's   :4824          
  
 right_eyebrow_inner_end_x right_eyebrow_inner_end_y right_eyebrow_outer_end_x
 Min.   : 6.921            Min.   :16.48             Min.   : 3.826           
 1st Qu.:37.552            1st Qu.:27.79             1st Qu.:13.562           
 Median :39.299            Median :29.57             Median :15.786           
 Mean   :39.322            Mean   :29.50             Mean   :15.871           
 3rd Qu.:40.917            3rd Qu.:31.25             3rd Qu.:17.999           
 Max.   :76.582            Max.   :62.08             Max.   :58.418           
 NA's   :4779              NA's   :4779              NA's   :4813     
        
 right_eyebrow_outer_end_y   nose_tip_x      nose_tip_y    mouth_left_corner_x
 Min.   :13.22             Min.   :12.94   Min.   :17.93   Min.   :22.92      
 1st Qu.:28.21             1st Qu.:46.60   1st Qu.:59.29   1st Qu.:61.26      
 Median :30.32             Median :48.42   Median :63.45   Median :63.18      
 Mean   :30.43             Mean   :48.37   Mean   :62.72   Mean   :63.29      
 3rd Qu.:32.66             3rd Qu.:50.33   3rd Qu.:66.49   3rd Qu.:65.38      
 Max.   :66.75             Max.   :89.44   Max.   :95.94   Max.   :84.77      
 NA's   :4813                                              NA's   :4780     
  
 mouth_left_corner_y mouth_right_corner_x mouth_right_corner_y
 Min.   :57.02       Min.   : 2.246       Min.   :56.69       
 1st Qu.:72.88       1st Qu.:30.798       1st Qu.:73.26       
 Median :75.78       Median :32.982       Median :76.00       
 Mean   :75.97       Mean   :32.900       Mean   :76.18       
 3rd Qu.:78.88       3rd Qu.:35.101       3rd Qu.:78.96       
 Max.   :94.67       Max.   :74.018       Max.   :95.51       
 NA's   :4780        NA's   :4779         NA's   :4779        

 mouth_center_top_lip_x mouth_center_top_lip_y mouth_center_bottom_lip_x
 Min.   :12.61          Min.   :56.72          Min.   :12.54            
 1st Qu.:46.49          1st Qu.:69.40          1st Qu.:46.57            
 Median :47.91          Median :72.61          Median :48.59            
 Mean   :47.98          Mean   :72.92          Mean   :48.57            
 3rd Qu.:49.30          3rd Qu.:76.22          3rd Qu.:50.68            
 Max.   :83.99          Max.   :94.55          Max.   :89.44            
 NA's   :4774           NA's   :4774           NA's   :33            
   
 mouth_center_bottom_lip_y    Image          
 Min.   :25.85             Length:7049       
 1st Qu.:75.55             Class :character  
 Median :78.70             Mode  :character  
 Mean   :78.97                               
 3rd Qu.:82.23                               
 Max.   :95.81                               
 NA's   :33                                  

For the following six location-points, NA’s are less than 33:

1.left_eye_center_x  2.left_eye_center_y  3.right_eye_center_x  4.right_eye_center_y
5.mouth_center_bottom_lip_x   6.mouth_center_bottom_lip_y 

For the following two location points there are no missing values:

1.nose_tip_x      2.nose_tip_y 

For all others i.e 22 locations, numbers of missing points are between 4774-4824. List is below:

1.left_eye_inner_corner_x    2.left_eye_inner_corner_y    3.left_eye_outer_corner_x
4.left_eye_outer_corner_y    5.right_eye_inner_corner_x   6.right_eye_inner_corner_y
7.right_eye_outer_corner_x   8.right_eye_outer_corner_y   9.left_eyebrow_inner_end_x
10.left_eyebrow_inner_end_y  11.left_eyebrow_outer_end_x  12.left_eyebrow_outer_end_y
13.right_eyebrow_inner_end_x 14.right_eyebrow_inner_end_y 15.right_eyebrow_outer_end_x
16.right_eyebrow_outer_end_y 17.mouth_left_corner_x       18.mouth_left_corner_y 
19.mouth_right_corner_x      20.mouth_right_corner_y      21.mouth_center_top_lip_x 
22.mouth_center_top_lip_y

It is instructive to calculate standard deviation of this data for it may show how close (or tough) our prediction task can be. A large variation implies tougher prediction job.

> library(plyr)
  # Omit NA values and calculate sddev, column wise for 30 columns
> colwise(sd)(na.omit(df.train[,1:30]))
  left_eye_center_x  left_eye_center_y  right_eye_center_x  right_eye_center_y
1          2.087683          2.294027           2.051575           2.234334

  left_eye_inner_corner_x   left_eye_inner_corner_y      left_eye_outer_corner_x
1                2.005631                  2.0345                  2.701639

  left_eye_outer_corner_y   right_eye_inner_corner_x     right_eye_inner_corner_y
1                2.684162                 1.822784                 2.009505

  right_eye_outer_corner_x  right_eye_outer_corner_y     left_eyebrow_inner_end_x
1                 2.768804                 2.654903                2.819914

  left_eyebrow_inner_end_y  left_eyebrow_outer_end_x     left_eyebrow_outer_end_y
1                 2.867131                 3.312647                3.627187

  right_eyebrow_inner_end_x  right_eyebrow_inner_end_y   right_eyebrow_outer_end_x
1                 2.609648                  2.842219               3.337901

  right_eyebrow_outer_end_y  nose_tip_x     nose_tip_y   mouth_left_corner_x
1                 3.644342   3.276053      4.528635                3.650131

  mouth_left_corner_y        mouth_right_corner_x        mouth_right_corner_y
1            4.438565                 3.595103                      4.259514

  mouth_center_top_lip_x     mouth_center_top_lip_y      mouth_center_bottom_lip_x
1               2.723274               5.108675                     3.032389

  mouth_center_bottom_lip_y
1               4.813557

Thus, while in some cases standard deviation is small in other cases it is quite large. In those cases where variation is not large enough (1.8, for example) even mean of column value may serve as good prediction value.

In modelling, larger the number of observations, the better is the model. And further, as at one time, we are taking only one location point (response variable) for prediction, it implies the size of training data will vary from location point to location point depending upon how many missing values are in that column. For example, for predicting nose_tip_, we can consider all the 7049 observations but for mouth_center_top_lip_ we will have just, 7049-4774=2275, observations.

I, however, took a short-cut and for training considered only those rows where complete data was available. Yet, I will not advise others to do so. The following R-code filters complete cases to a file on hard-disk. The code is well commented

# R code to generate complete cases from training.csv

# Library for parallel operations
library(doMC)
registerDoMC(cores=4)
# your working directory
setwd("/home/ashokharnal/facial_key_points/")

# Read tarining file. Note that all fields are comma separated
#   but each Image has 96*96 =9216 values, space separated
#    to avoid Image being treaded as factor, set stringsAsFactors=FALSE
#      (default is TRUE)
df.train<-read.csv("training.csv", stringsAsFactors=FALSE, header=TRUE)

# Omit rows with incomplete classification data
ok<-complete.cases(df.train)
filtered.df.train<-df.train[ok,]
# So how many rows are left?
dim(filtered.df.train)

# Get all image data (field) into another variable
im.train<-filtered.df.train$Image
# Remove image data from filtered data
filtered.df.train$Image<-NULL
# But, introduce an ID column for later merger
filtered.df.train$id<-1:dim(filtered.df.train)[1]

# Split image data on space.
#   Split row by row (foreach)
#    and combine (.combine) each rows data by rbind
#     Do it in parallel (%dopar%)
im.train <- foreach(im = im.train, .combine=rbind) %dopar% {
    as.integer(unlist(strsplit(im, " ")))
}

# Check resulting data structure and its class
str(im.train)
class(im.train)
# Convert it to a data frame
df.im.train<-data.frame(im.train)
# Remove row names
row.names(df.im.train)<-NULL
dim(df.im.train)

# Add an ID to this image data
df.im.train$id<-1:dim(df.im.train)[1]
# Just check what default column names this data has
colnames(df.im.train)

# Merge now complete cases filtered data (30 columns) 
#   with corresponding image data 
#     Merger is on ID column
#       Then remove ID column and save the data frame to hard disk
df <- data.frame(merge(filtered.df.train,df.im.train,by="id"))
df$id<-NULL
dim(df)
write.csv(df,"complete.csv",row.names=F,quote=F)

# Recheck names of first 30 col names
colnames(df[,1:30])
# Check names of next 30 col names
colnames(df[,31:61])

test.csv‘ file has just two columns. One is the ImageId column and the other ‘Image‘ column. Pixel values in Image field are separated by spaces. All Image data, as in training file, is space separated. We need to remove ImageId column from ‘test.csv‘ and introduce comma between pixel intensity values and save the resulting data frame to disk. The following R code does this. Code being same as above, is not commented. Note that in file ‘test.csv‘ columns for facial key-points are not there. That is the prediction job.

# R code to convert test data Image field to csv format
library(doMC)
registerDoMC(cores=4)
setwd("/home/ashokharnal/Documents/facial_key_points/")
df.test<-read.csv("test.csv", stringsAsFactors=FALSE, header=TRUE)
im.test<-df.test$Image
im.test <- foreach(im = im.test, .combine=rbind) %dopar% {
    as.integer(unlist(strsplit(im, " ")))
}
df<-data.frame(im.test)
row.names(df)<-NULL
write.csv(df,"t1.csv",row.names=F,quote=F)

Building Model and making predictions

Once we have the data files ready, all we have to do is to use R to model data using deep learning algorithm. The code for this is given below. Explanations for model building are given thereafter. I have commented the code for easy understanding.

library(h2o)

# Start h2o from within R
#  Also decide min and max memory sizes
#   as also the number of cpu cores to be used (-1 means all cores)
localH2O <- h2o.init(ip = "localhost", port = 54321, startH2O = TRUE, max_mem_size = '12g', min_mem_size = '4g', nthreads = -1)

setwd("/home/ganesh/Documents/")

# Read train and test files
complete<-read.csv("complete.csv",header=T)
test<-read.csv("t.csv",header=T)

# Process begins now
Sys.time()
# Convert test data frame to h2o format
test.hex<-as.h2o(localH2O, test, key = 'test.hex')

# Initialse data frame with as many rows as are in 'test' to store results
#  Predicted response columns will be appended to this data frame
result<-data.frame(1:dim(test)[1])

# Make predictions for columns from 'start' to 'end' one by one
# (Total columns 30)
start<-1	# Start from attribute 1
end<-5          # End at attribute 5

for ( i in start:end )
	{
	# Out of 30 columns, remove from 'complete' dataFrame
        #   all columns but the response column
	#     ie column to make predictions for will stay
        #       along with columns of pixel intensity values
	col<-1:30
	col<-col[-i]
	# Filter columns from training set accordingly
	part<-complete[,-col]

	# Convert the training data frame to h2o format
	print("Convert part of csv data to h2o format")
	part.hex<-as.h2o(localH2O, part, key = 'part.hex')
	# Print the column number immediately (flush.console)
	print(i)
	flush.console()

	# Start modeling process
	c_name<-paste("Modeling for ",names(part)[1],sep="")
	# Print column name being modeled
	print(c_name)
	flush.console()

	# epoch is a learning cycle or one pass.
	# Training your network on each obs of the set once is an epoch. 

	model <- h2o.deeplearning(x = 2:9217, y = 1,  data = part.hex, nfolds = 10, l1=1e-5 ,  activation = "RectifierWithDropout", input_dropout_ratio = 0.2, hidden_dropout_ratios = c(0.5,0.5,0.5,0.5,0.5,0.5), hidden = c(200,200,100,100,50,50),  classification=FALSE, epochs = 40)

	print("Modeling completed")
	flush.console()

	## Predictions
	# In test data frame, make predictions for this column
	test_predict.hex <- h2o.predict(model, test.hex)
        # Transform it to dataframe format
	test_predict <- as.data.frame(test_predict.hex)
	# Change column name of test_predict to that of response column
	colnames(test_predict)=names(part)[1]

	# Append predicted response column to result dataframe
	result[i-start+1]<-test_predict

	# Write every result to file (sample file name is: first5.csv)
	result_file<-paste("first",end,".csv",sep="")
	write.csv(result, file = result_file , row.names=FALSE, quote=FALSE)

	# Remove garbage & release memory to OS. 
	gc()
	}

# Analysis Ending time
Sys.time()

# Before you exit R, shutdown h2o
h2o.shutdown(localH2O, prompt=FALSE)

A good example on deep learning from h2o documentation that explains parameters in detail is here. Parameters that a deep-learning model may use are given below:


h2o.deeplearning(x, y, data, key = "",override_with_best_model, classification = TRUE,
nfolds = 0, validation, holdout_fraction = 0, checkpoint = "", autoencoder,
use_all_factor_levels, activation, hidden, epochs, train_samples_per_iteration,
seed, adaptive_rate, rho, epsilon, rate, rate_annealing, rate_decay,
momentum_start, momentum_ramp, momentum_stable, nesterov_accelerated_gradient,
input_dropout_ratio, hidden_dropout_ratios, l1, l2, max_w2,
initial_weight_distribution, initial_weight_scale, loss,
score_interval, score_training_samples, score_validation_samples,
score_duty_cycle, classification_stop, regression_stop, quiet_mode,
max_confusion_matrix_size, max_hit_ratio_k, balance_classes, class_sampling_factors,
max_after_balance_size, score_validation_sampling, diagnostics,
variable_importances, fast_mode, ignore_const_cols, force_load_balance,
replicate_training_data, single_node_mode, shuffle_training_data,
sparse, col_major, max_categorical_features, reproducible)

We are using 6 hidden layers with number of neurons as 200,200,100,100,50,50. When response variable is continuous, classification is FALSE; it is regression. In n-fold cross-validation, data is partitioned into n-parts. One part is reserved for cross-validation and model is built using the other (n-1) parts. One by one, other folds are reserved and used for cross-validation in turn. n-results from this cross-validation are then averaged to produce an accuracy estimate. nfolds in our case is 10. H2O deep learning offers a number of choices for activation function. Among them, three are: sigmoid, tanh and rectifier. Rectifier is quite accurate and is faster. Hyperbolic tangent is computationally expensive. Dropouts are used to switch off certain neurons so as to avoid over fitting. input_dropout_ratio is for dropping off neurons from initial layer and hidden_dropout_ratios for dropping off neurons from hidden layers. When dropout ratios are specified, activation function is ‘RectifierWithDropout’. Parameter l1 is also to avoid over fitting. It allows only strong weights. You can download h2o package documentation from here.

After you have run the above R code, exit R, Restart R again. Change ‘start‘ and ‘end‘ values to 6 and 10 and run R-code once again. Why this? Why not go through 1 to 10 or 1 to 30 in one go? It is because I observed that memory after every for loop is not being released to operating system. After 6 loops are so, the process becomes very slow. I have to exit from R and then start R again and go through the for loop once again from where it terminated last time. I, therefore, run R-code six times (1-5, 6-10, 11-15, 16-20, 21-25 and 26-30). And this results in predicted values for all 30 columns being written to six files: first5.csv, first10.csv, first15.csv, first20.csv, first25.csv, first30.csv; each contains five columns of predicted data for the relevant columns.

Once we have all the predicted columns, the following R code compiles and prepares the data for submission to Kaggle.

library(reshape2)

setwd("/home/ganesh/Documents/")
# Read the truncated training file that we wrote above
#  Read it just to know column names of facial key-points
complete<-read.csv("complete.csv",header=T)
# Read modified test file. 
#  I want to re-verify number of rows/images
test<-read.csv("t.csv",header=T)

# Read prediction files one by one
first5<-read.csv("first5.csv",header=T)        # Cols 1 to 5
first10<-read.csv("first10.csv",header=T)      # Cols 6 to 10
first15<-read.csv("first15.csv",header=T)      # Cols 11 to 15
first20<-read.csv("first20.csv",header=T)      # Cols 16 to 20
first25<-read.csv("first25.csv",header=T)      # Cols 21 to 25
first30<-read.csv("first30.csv",header=T)      # Cols 26 to 30

# Just check if all of them have same number of rows
dim(first5)
dim(first10)
dim(first15)
dim(first20)
dim(first25)
dim(first30)

# Start merging all in 'first' data frame
first<-first5
first[,6:10]<-first10
first[,11:15]<-first15
first[,16:20]<-first20
first[,21:25]<-first25
first[,26:30]<-first30

# Assign col names to 'first' for 30 columns
colnames(first)<-names(complete[,1:30])

# Give a unique name to file that will save to disk all prediction results
#   (as you may be repeating this expt many times)
#     add date+time to its name
dt<-Sys.time()
datetime<-format(dt, format="%d%m%Y%H%M")
result_filename<-paste("first",datetime,".csv",sep="")
# This file is not to be submitted but contains column wise data
write.csv(first, file = result_filename , row.names=FALSE, quote=FALSE)

# Create a data frame with as many rows as in test.
#   ImageId column contains seq number 
predictions <- data.frame(ImageId = 1:nrow(test))
predictions[2:31]<-first         # Add other 30 columns to it
head(predictions)                # Check

# Restack predictions, ImageId wise
submission <- melt(predictions, id.vars="ImageId", variable.name="FeatureName", value.name="Location")
head(submission)
# Read IdLookupTable.csv file downoloaded from Kaggle 
Id.lookup <- read.csv("IdLookupTable.csv",header=T)
Idlookup_colnames <- names(Id.lookup)
Idlookup_colnames
Id.lookup$Location <- NULL

# Row wise merger. A row in 'Id.lookup' is merged with same row in 'submission'.
#   At least one column name should be same.
# When all.x=TRUE, an extra row will be added to the output for each case in Id.lookup
#   that has no matching cases in submission.
#  Cases that do not have values from submission will be labeled as missing.
#    See: https://kb.iu.edu/d/azux    http://www.cookbook-r.com/Manipulating_data/Merging_data_frames/
msub <- merge(Id.lookup, submission, all.x=T, sort=F)
# Adds columns (RowId) not in msub
nsub <- msub[, Idlookup_colnames]

# Give a unique name to submission file
#  Add date+time to its name
submit_file<-paste("submit",datetime,".csv",sep="")
# Write to disk file for submission to Kaggle
write.csv(nsub[,c(1,4)], file=submit_file, quote=F, row.names=F)

This finishes our experiment. As mentioned above, if you avoid the short-cut of using minimum observations (as in, ‘complete.csv‘) for training the model for all columns, you can achieve much better accuracy while at the same time working from an ordinary machine. A RAM of 8GB should give you very respectable score. On an 8GB machine reduce the number of hidden layers; and maybe use for-loop sequence of 1 to 3 columns in model building. My Kaggle score page-image is below. Good luck! (Edited: Avoiding Shortcut: Some more editions have been made since I wrote the above. Please read on below!)

kaggle score

**EDITED** Avoiding Shortcut

If you want to avoid the short-cut and want to build the model by taking into account all available data, column by column, then the following two R-codes will work for you. The first one creates a file ‘complete.csv’ that converts image data to csv format and concatenates it with its 30-column characteristics.

# R code to transform training.csv file
#  Spaces within pixel intensity values are replaced by commas and image data is transformed
#    into data frame

library(doMC)
registerDoMC(cores=4)

setwd("/home/ganesh/Documents/data_analysis/facial_key_points/")

# Read tarining file. To avoid Image being treaded as factor,
#   set stringsAsFactors=FALSE ( default is TRUE)
df.train<-read.csv("training.csv", stringsAsFactors=FALSE, header=TRUE)

# So how many rows are there?
dim(df.train)

# Get all image data into another variable
im.train<-df.train$Image
df.train$Image<-NULL

# Introduce an ID column in data
df.train$id<-1:dim(df.train)[1]

# Split image data on space and insert comma instead
im.train <- foreach(im = im.train, .combine=rbind) %dopar% {
    as.integer(unlist(strsplit(im, " ")))
}

# Convert it to a data frame
df.im.train<-data.frame(im.train)
# Remove row names
row.names(df.im.train)<-NULL

# Add an ID to this image data
df.im.train$id<-1:dim(df.im.train)[1]
# Just check what default column names data has
colnames(df.im.train)


# Merge now train data (30 columns) with corresponding image data 
#  Merger is on ID column
#   Then remove ID column and save the data frame to hard disk
df <- data.frame(merge(df.train,df.im.train,by="id"))
df$id<-NULL
dim(df)
write.csv(df,"complete.csv",row.names=F,quote=F)

# Recheck names of first 30 col names
colnames(df[,1:30])
# Check names of next 30 col names
colnames(df[,31:61])

Once you have the file ‘complete.csv’ on hard disk, the following R code uses h2o to train each of the 30 columns as per the actual number of training data for that particular column. Thus all data is used for training. But then if your machine does not have sufficient RAM, the model building may be time consuming and test your patience.


# R file to create deep learning model 
# Takes into account NAs per column of data rather
#  than for complete data set.
#    This model is improvement over the earlier model but consumes a lot of time
#      complete.csv is full data set including comma separated image pixel values

library(h2o)
localH2O <- h2o.init(ip = "localhost", port = 54321, startH2O = TRUE, max_mem_size = '12g', min_mem_size = '4g', nthreads = -1)

# Set working directory
setwd("/home/ganesh/Documents/")

complete<-read.csv("complete.csv",header=T)
test<-read.csv("t.csv",header=T)

# Analysis begin time
Sys.time()
# Convert test.csv to h2o format
test.hex<-as.h2o(localH2O, test, key = 'test.hex')
result<-data.frame(1:dim(test)[1])
# Build models for five columns at a time
start<-1           # We begin from column 1	
end<-5             # Last one is column 5

for ( i in start:end )
	{
	col<-1:30
	col<-col[-i]
	part<-complete[,-col]
	ok<-complete.cases(part)
	part<-part[ok,]
	print(paste("Records in data set are: ",nrow(part)))
	flush.console()	
	print("Convert part of csv data to h2o format")
	part.hex<-as.h2o(localH2O, part, key = 'part.hex')
	print(i)
	flush.console()
	c_name<-paste("Modeling for ",names(part)[1],sep="")
	# Print column name being modeled
	print(c_name)
	flush.console()

	model <- h2o.deeplearning(x = 2:9217, y = 1,  data = part.hex, nfolds = 10, l1=1e-5 ,  activation = "RectifierWithDropout", input_dropout_ratio = 0.2, hidden_dropout_ratios = c(0.5,0.5,0.5,0.5,0.5,0.5), hidden = c(200,200,100,100,50,50),  classification=FALSE, epochs = 30)

	print("Modeling completed")
	flush.console()

	## Predictions
	# Make predictions for this column
	test_predict.hex <- h2o.predict(model, test.hex)
	test_predict <- as.data.frame(test_predict.hex)
	# Change column name of test_predict to that of response column
	colnames(test_predict)=names(part)[1]

	# Append predicted response column to result dataframe
	result[i-start+1]<-test_predict

	# Write every result to file
	result_file<-paste("first",end,".csv",sep="")
	write.csv(result, file = result_file , row.names=FALSE, quote=FALSE)

	# Remove garbage & release memory to OS. 
	gc()
	}

# Analysis Ending time
Sys.time()

# Before you exit R, shutdown h2o
h2o.shutdown(localH2O, prompt=FALSE)

Handwritten digits recognition using one-against-all classification (oaa) in Vowpal Wabbit

March 6, 2015

Kaggle recently hosted a machine learning competition to recognize handwritten digits from 0 to 9. Handwritten digits have been taken from MNIST database (Modified National Institute of Standards and Technology). We decided to use Vowpal Wabbit for learning the pattern of handwritten digits in the training file and apply the learning on the ‘test’ dataset to predict what all digits it represented. Our score on Kaggle was 0.97943 i.e. 97.94% accurate prediction.

Vowpal Wabbit is very easy to install. Its installation on CentOS may not take more than 20 minutes. See the instructions here.

Dataset is in two files: train.csv and test.csv. File, train.csv, contains 42,000 images each of a single handwritten digit from 0-9. Each image is 28 X 28 pixels that is in all 784 pixels–all lined up in one long row. First five lines of train.csv appear as follows:


# No of lines in train.csv
$ wc -l train.csv
42001 train.csv

# No of lines in test.csv
bash-4.1$ wc -l test.csv
28001 test.csv

# Show first five lines of train.csv
$ head --lines 5 train.csv
label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,pixel10,pixel11,pixel12,pixel13,pixel14,pixel15,pixel16,pixel17,pixel18,pixel19,pixel20,pixel21,pixel22,pixel23,pixel24,pixel25,pixel26,pixel27,pixel28,pixel29,pixel30,pixel31,pixel32,pixel33,pixel34,pixel35,pixel36,pixel37,pixel38,pixel39,pixel40,pixel41,pixel42,pixel43,pixel44,pixel45,pixel46,pixel47,pixel48,pixel49,pixel50,pixel51,pixel52,pixel53,pixel54,pixel55,pixel56,pixel57,pixel58,pixel59,pixel60,pixel61,pixel62,pixel63,pixel64,pixel65,pixel66,pixel67,pixel68,pixel69,pixel70,pixel71,pixel72,pixel73,pixel74,pixel75,pixel76,pixel77,pixel78,pixel79,pixel80,pixel81,pixel82,pixel83,pixel84,pixel85,pixel86,pixel87,pixel88,pixel89,pixel90,pixel91,pixel92,pixel93,pixel94,pixel95,pixel96,pixel97,pixel98,pixel99,pixel100,pixel101,pixel102,pixel103,pixel104,pixel105,pixel106,pixel107,pixel108,pixel109,pixel110,pixel111,pixel112,pixel113,pixel114,pixel115,pixel116,pixel117,pixel118,pixel119,pixel120,pixel121,pixel122,pixel123,pixel124,pixel125,pixel126,pixel127,pixel128,pixel129,pixel130,pixel131,pixel132,pixel133,pixel134,pixel135,pixel136,pixel137,pixel138,pixel139,pixel140,pixel141,pixel142,pixel143,pixel144,pixel145,pixel146,pixel147,pixel148,pixel149,pixel150,pixel151,pixel152,pixel153,pixel154,pixel155,pixel156,pixel157,pixel158,pixel159,pixel160,pixel161,pixel162,pixel163,pixel164,pixel165,pixel166,pixel167,pixel168,pixel169,pixel170,pixel171,pixel172,pixel173,pixel174,pixel175,pixel176,pixel177,pixel178,pixel179,pixel180,pixel181,pixel182,pixel183,pixel184,pixel185,pixel186,pixel187,pixel188,pixel189,pixel190,pixel191,pixel192,pixel193,pixel194,pixel195,pixel196,pixel197,pixel198,pixel199,pixel200,pixel201,pixel202,pixel203,pixel204,pixel205,pixel206,pixel207,pixel208,pixel209,pixel210,pixel211,pixel212,pixel213,pixel214,pixel215,pixel216,pixel217,pixel218,pixel219,pixel220,pixel221,pixel222,pixel223,pixel224,pixel225,pixel226,pixel227,pixel228,pixel229,pixel230,pixel231,pixel232,pixel233,pixel234,pixel235,pixel236,pixel237,pixel238,pixel239,pixel240,pixel241,pixel242,pixel243,pixel244,pixel245,pixel246,pixel247,pixel248,pixel249,pixel250,pixel251,pixel252,pixel253,pixel254,pixel255,pixel256,pixel257,pixel258,pixel259,pixel260,pixel261,pixel262,pixel263,pixel264,pixel265,pixel266,pixel267,pixel268,pixel269,pixel270,pixel271,pixel272,pixel273,pixel274,pixel275,pixel276,pixel277,pixel278,pixel279,pixel280,pixel281,pixel282,pixel283,pixel284,pixel285,pixel286,pixel287,pixel288,pixel289,pixel290,pixel291,pixel292,pixel293,pixel294,pixel295,pixel296,pixel297,pixel298,pixel299,pixel300,pixel301,pixel302,pixel303,pixel304,pixel305,pixel306,pixel307,pixel308,pixel309,pixel310,pixel311,pixel312,pixel313,pixel314,pixel315,pixel316,pixel317,pixel318,pixel319,pixel320,pixel321,pixel322,pixel323,pixel324,pixel325,pixel326,pixel327,pixel328,pixel329,pixel330,pixel331,pixel332,pixel333,pixel334,pixel335,pixel336,pixel337,pixel338,pixel339,pixel340,pixel341,pixel342,pixel343,pixel344,pixel345,pixel346,pixel347,pixel348,pixel349,pixel350,pixel351,pixel352,pixel353,pixel354,pixel355,pixel356,pixel357,pixel358,pixel359,pixel360,pixel361,pixel362,pixel363,pixel364,pixel365,pixel366,pixel367,pixel368,pixel369,pixel370,pixel371,pixel372,pixel373,pixel374,pixel375,pixel376,pixel377,pixel378,pixel379,pixel380,pixel381,pixel382,pixel383,pixel384,pixel385,pixel386,pixel387,pixel388,pixel389,pixel390,pixel391,pixel392,pixel393,pixel394,pixel395,pixel396,pixel397,pixel398,pixel399,pixel400,pixel401,pixel402,pixel403,pixel404,pixel405,pixel406,pixel407,pixel408,pixel409,pixel410,pixel411,pixel412,pixel413,pixel414,pixel415,pixel416,pixel417,pixel418,pixel419,pixel420,pixel421,pixel422,pixel423,pixel424,pixel425,pixel426,pixel427,pixel428,pixel429,pixel430,pixel431,pixel432,pixel433,pixel434,pixel435,pixel436,pixel437,pixel438,pixel439,pixel440,pixel441,pixel442,pixel443,pixel444,pixel445,pixel446,pixel447,pixel448,pixel449,pixel450,pixel451,pixel452,pixel453,pixel454,pixel455,pixel456,pixel457,pixel458,pixel459,pixel460,pixel461,pixel462,pixel463,pixel464,pixel465,pixel466,pixel467,pixel468,pixel469,pixel470,pixel471,pixel472,pixel473,pixel474,pixel475,pixel476,pixel477,pixel478,pixel479,pixel480,pixel481,pixel482,pixel483,pixel484,pixel485,pixel486,pixel487,pixel488,pixel489,pixel490,pixel491,pixel492,pixel493,pixel494,pixel495,pixel496,pixel497,pixel498,pixel499,pixel500,pixel501,pixel502,pixel503,pixel504,pixel505,pixel506,pixel507,pixel508,pixel509,pixel510,pixel511,pixel512,pixel513,pixel514,pixel515,pixel516,pixel517,pixel518,pixel519,pixel520,pixel521,pixel522,pixel523,pixel524,pixel525,pixel526,pixel527,pixel528,pixel529,pixel530,pixel531,pixel532,pixel533,pixel534,pixel535,pixel536,pixel537,pixel538,pixel539,pixel540,pixel541,pixel542,pixel543,pixel544,pixel545,pixel546,pixel547,pixel548,pixel549,pixel550,pixel551,pixel552,pixel553,pixel554,pixel555,pixel556,pixel557,pixel558,pixel559,pixel560,pixel561,pixel562,pixel563,pixel564,pixel565,pixel566,pixel567,pixel568,pixel569,pixel570,pixel571,pixel572,pixel573,pixel574,pixel575,pixel576,pixel577,pixel578,pixel579,pixel580,pixel581,pixel582,pixel583,pixel584,pixel585,pixel586,pixel587,pixel588,pixel589,pixel590,pixel591,pixel592,pixel593,pixel594,pixel595,pixel596,pixel597,pixel598,pixel599,pixel600,pixel601,pixel602,pixel603,pixel604,pixel605,pixel606,pixel607,pixel608,pixel609,pixel610,pixel611,pixel612,pixel613,pixel614,pixel615,pixel616,pixel617,pixel618,pixel619,pixel620,pixel621,pixel622,pixel623,pixel624,pixel625,pixel626,pixel627,pixel628,pixel629,pixel630,pixel631,pixel632,pixel633,pixel634,pixel635,pixel636,pixel637,pixel638,pixel639,pixel640,pixel641,pixel642,pixel643,pixel644,pixel645,pixel646,pixel647,pixel648,pixel649,pixel650,pixel651,pixel652,pixel653,pixel654,pixel655,pixel656,pixel657,pixel658,pixel659,pixel660,pixel661,pixel662,pixel663,pixel664,pixel665,pixel666,pixel667,pixel668,pixel669,pixel670,pixel671,pixel672,pixel673,pixel674,pixel675,pixel676,pixel677,pixel678,pixel679,pixel680,pixel681,pixel682,pixel683,pixel684,pixel685,pixel686,pixel687,pixel688,pixel689,pixel690,pixel691,pixel692,pixel693,pixel694,pixel695,pixel696,pixel697,pixel698,pixel699,pixel700,pixel701,pixel702,pixel703,pixel704,pixel705,pixel706,pixel707,pixel708,pixel709,pixel710,pixel711,pixel712,pixel713,pixel714,pixel715,pixel716,pixel717,pixel718,pixel719,pixel720,pixel721,pixel722,pixel723,pixel724,pixel725,pixel726,pixel727,pixel728,pixel729,pixel730,pixel731,pixel732,pixel733,pixel734,pixel735,pixel736,pixel737,pixel738,pixel739,pixel740,pixel741,pixel742,pixel743,pixel744,pixel745,pixel746,pixel747,pixel748,pixel749,pixel750,pixel751,pixel752,pixel753,pixel754,pixel755,pixel756,pixel757,pixel758,pixel759,pixel760,pixel761,pixel762,pixel763,pixel764,pixel765,pixel766,pixel767,pixel768,pixel769,pixel770,pixel771,pixel772,pixel773,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,188,255,94,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,191,250,253,93,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,123,248,253,167,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,80,247,253,208,13,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,29,207,253,235,77,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,54,209,253,253,88,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,93,254,253,238,170,17,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,23,210,254,253,159,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,16,209,253,254,240,81,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,27,253,253,254,13,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,206,254,254,198,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,168,253,253,196,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,203,253,248,76,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,22,188,253,245,93,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,103,253,253,191,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,240,253,195,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,15,220,253,253,80,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,94,253,253,253,94,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,251,253,250,131,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,214,218,95,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,18,30,137,137,192,86,72,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,13,86,250,254,254,254,254,217,246,151,32,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,16,179,254,254,254,254,254,254,254,254,254,231,54,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,72,254,254,254,254,254,254,254,254,254,254,254,254,104,0,0,0,0,0,0,0,0,0,0,0,0,0,61,191,254,254,254,254,254,109,83,199,254,254,254,254,243,85,0,0,0,0,0,0,0,0,0,0,0,0,172,254,254,254,202,147,147,45,0,11,29,200,254,254,254,171,0,0,0,0,0,0,0,0,0,0,0,1,174,254,254,89,67,0,0,0,0,0,0,128,252,254,254,212,76,0,0,0,0,0,0,0,0,0,0,47,254,254,254,29,0,0,0,0,0,0,0,0,83,254,254,254,153,0,0,0,0,0,0,0,0,0,0,80,254,254,240,24,0,0,0,0,0,0,0,0,25,240,254,254,153,0,0,0,0,0,0,0,0,0,0,64,254,254,186,7,0,0,0,0,0,0,0,0,0,166,254,254,224,12,0,0,0,0,0,0,0,0,14,232,254,254,254,29,0,0,0,0,0,0,0,0,0,75,254,254,254,17,0,0,0,0,0,0,0,0,18,254,254,254,254,29,0,0,0,0,0,0,0,0,0,48,254,254,254,17,0,0,0,0,0,0,0,0,2,163,254,254,254,29,0,0,0,0,0,0,0,0,0,48,254,254,254,17,0,0,0,0,0,0,0,0,0,94,254,254,254,200,12,0,0,0,0,0,0,0,16,209,254,254,150,1,0,0,0,0,0,0,0,0,0,15,206,254,254,254,202,66,0,0,0,0,0,21,161,254,254,245,31,0,0,0,0,0,0,0,0,0,0,0,60,212,254,254,254,194,48,48,34,41,48,209,254,254,254,171,0,0,0,0,0,0,0,0,0,0,0,0,0,86,243,254,254,254,254,254,233,243,254,254,254,254,254,86,0,0,0,0,0,0,0,0,0,0,0,0,0,0,114,254,254,254,254,254,254,254,254,254,254,239,86,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,13,182,254,254,254,254,254,254,254,254,243,70,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,76,146,254,255,254,255,146,19,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,141,139,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,106,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,185,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,146,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,156,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,185,255,255,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,185,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,185,254,254,184,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,63,254,254,62,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,220,179,6,0,0,0,0,0,0,0,0,9,77,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,28,247,17,0,0,0,0,0,0,0,0,27,202,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,242,155,0,0,0,0,0,0,0,0,27,254,63,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,160,207,6,0,0,0,0,0,0,0,27,254,65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,127,254,21,0,0,0,0,0,0,0,20,239,65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,77,254,21,0,0,0,0,0,0,0,0,195,65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,70,254,21,0,0,0,0,0,0,0,0,195,142,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,56,251,21,0,0,0,0,0,0,0,0,195,227,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,222,153,5,0,0,0,0,0,0,0,120,240,13,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,67,251,40,0,0,0,0,0,0,0,94,255,69,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,234,184,0,0,0,0,0,0,0,19,245,69,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,234,169,0,0,0,0,0,0,0,3,199,182,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,154,205,4,0,0,26,72,128,203,208,254,254,131,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,61,254,129,113,186,245,251,189,75,56,136,254,73,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,15,216,233,233,159,104,52,0,0,0,38,254,73,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,18,254,73,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,18,254,73,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,206,106,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,186,159,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,209,101,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

The first column is ‘label’ the digit itself and the rest are 784 pixel (intensity) values, an integer between 0 and 255 (inclusive). These are gray-scale images. For any row, if you remove the label column and wrap pixels in groups of 28 (i.e. a matrix of 28 X 28), one after another, you will see the picture of the digit emerging. In the following R code, we save second row of train.csv in 28X28 matrix to a file :

> # Read all data
> data<-read.csv("train.csv",header=T)
> # Its dimesions?
> dim(data)
[1] 42000   785
> # Just the second row and only pixels not label
> pixels<-data[2,2:785]
> dim(pixels)
[1]   1 784
> # What is the label
> data[2,1]
[1] 0
># Save in a file but while saving round up [0,255] as [0,1]
> write.table(matrix(round(pixels/255),28,28), file = "second.txt", sep = " ", row.names = FALSE, col.names = FALSE)
># Append also to this file label value
>write.table(paste("label",data[2,1]), file = "second.txt", sep = " ", row.names = FALSE, col.names = FALSE,append=T)
>

Saved data in file ‘second.txt’ is as shown below. Pattern of ‘zero’ is obvious.

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0
0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0
0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0
0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0
0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0
0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0
0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0
0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0
0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
"label 0"

The following R code will directly plot this image:

train <- read.csv("train.csv", header=TRUE)
# Read 2nd row and ignore label column
data<-train[2,2:785]
data<-as.matrix(data,nrow=28,ncol=28)
dim(data)
data<-matrix(data,nrow=28,ncol=28)
dim(data)
##Color ramp def.
colors<-c('white','black')
# Create a function to generate a continuum of colors
#  of desired number of colours from white to black
ramp_pal<-colorRampPalette(colors=colors)
# Draw an image of data over a grid of x(1:28), y(1:28)
image(1:28,1:28,data,main="IInd row. Label=0",col=ramp_pal(256))

We will use Vowpal Wabbit to train our classifier. It is a multiclass problem (as against binary) in the sense that label column has 10 classes 0 to 9. Anyone of these 10 classes is possible. Vowpal Wabbit provides a number of algorithms to train multiclass records. One among them is One-Against-All (oaa) or one-against-rest classifier. This technique works as follows:

Suppose for a record label, there are K classes. We then create K binary classifiers. Each classifier makes a test for one class (+ve) against the rest (classed as -ve). Binary classifier treats a particular class (one among K) as positive and others are treated as negative. The pseudcode is as follows (please see Wikipedia):

Inputs:

  • 1. Labels, y, where yi ∈ {1, … K} is the label for the sample Xi
  • 2. Samples, X
  • 3. L, a learner (training algorithm for binary classifiers)

Output:

  • A list of classifiers fk for k ∈ {1, …, K}

Procedure:

  • For each k in {1, …, K}:
    • Step 1: Construct a new label vector yi = 1 only when yi = k else 0 (or −1)
    • Step 2: Apply L to (X, y) to obtain fk

Making decisions means applying all binary classifiers to an unseen sample x and predicting the label k for which the corresponding classifier reports the highest confidence score. Weaknesses of OAA strategy are that class distributions of classes may be significantly different. And even if they are balanced, one class typically sees a large number of negatives.

Vowpal Wabbit requires that csv file may first be converted to vw format. The input format for VW files is explained here. In our case all predictor variables (pixels) are numeric. Thus, VW format in our case would look like (five rows shown):


1 |image pixel1:0 pixel2:0 pixel3:0 pixel4:0 pixel5:0 pixel6:0 pixel7:0 pixel8:0 pixel9:0 pixel10:0 pixel11:0 pixel12:0 pixel13:0 pixel14:0 pixel15:0 pixel16:0 pixel17:0 pixel18:0 pixel19:0 pixel20:0 pixel21:0 pixel22:0 pixel23:0 pixel24:0 pixel25:0 pixel26:0 pixel27:0 pixel28:0 pixel29:0 pixel30:0 pixel31:0 pixel32:0 pixel33:0 pixel34:0 pixel35:0 pixel36:0 pixel37:0 pixel38:0 pixel39:0 pixel40:0 pixel41:0 pixel42:0 pixel43:0 pixel44:0 pixel45:0 pixel46:0 pixel47:0 pixel48:0 pixel49:0 pixel50:0 pixel51:0 pixel52:0 pixel53:0 pixel54:0 pixel55:0 pixel56:0 pixel57:0 pixel58:0 pixel59:0 pixel60:0 pixel61:0 pixel62:0 pixel63:0 pixel64:0 pixel65:0 pixel66:0 pixel67:0 pixel68:0 pixel69:0 pixel70:0 pixel71:0 pixel72:0 pixel73:0 pixel74:0 pixel75:0 pixel76:0 pixel77:0 pixel78:0 pixel79:0 pixel80:0 pixel81:0 pixel82:0 pixel83:0 pixel84:0 pixel85:0 pixel86:0 pixel87:0 pixel88:0 pixel89:0 pixel90:0 pixel91:0 pixel92:0 pixel93:0 pixel94:0 pixel95:0 pixel96:0 pixel97:0 pixel98:0 pixel99:0 pixel100:0 pixel101:0 pixel102:0 pixel103:0 pixel104:0 pixel105:0 pixel106:0 pixel107:0 pixel108:0 pixel109:0 pixel110:0 pixel111:0 pixel112:0 pixel113:0 pixel114:0 pixel115:0 pixel116:0 pixel117:0 pixel118:0 pixel119:0 pixel120:0 pixel121:0 pixel122:0 pixel123:0 pixel124:0 pixel125:0 pixel126:0 pixel127:0 pixel128:0 pixel129:0 pixel130:0 pixel131:0 pixel132:0 pixel133:0.737254901960784 pixel134:1 pixel135:0.368627450980392 pixel136:0 pixel137:0 pixel138:0 pixel139:0 pixel140:0 pixel141:0 pixel142:0 pixel143:0 pixel144:0 pixel145:0 pixel146:0 pixel147:0 pixel148:0 pixel149:0 pixel150:0 pixel151:0 pixel152:0 pixel153:0 pixel154:0 pixel155:0 pixel156:0 pixel157:0 pixel158:0 pixel159:0 pixel160:0.749019607843137 pixel161:0.980392156862745 pixel162:0.992156862745098 pixel163:0.364705882352941 pixel164:0 pixel165:0 pixel166:0 pixel167:0 pixel168:0 pixel169:0 pixel170:0 pixel171:0 pixel172:0 pixel173:0 pixel174:0 pixel175:0 pixel176:0 pixel177:0 pixel178:0 pixel179:0 pixel180:0 pixel181:0 pixel182:0 pixel183:0 pixel184:0 pixel185:0 pixel186:0 pixel187:0.482352941176471 pixel188:0.972549019607843 pixel189:0.992156862745098 pixel190:0.654901960784314 pixel191:0.0392156862745098 pixel192:0 pixel193:0 pixel194:0 pixel195:0 pixel196:0 pixel197:0 pixel198:0 pixel199:0 pixel200:0 pixel201:0 pixel202:0 pixel203:0 pixel204:0 pixel205:0 pixel206:0 pixel207:0 pixel208:0 pixel209:0 pixel210:0 pixel211:0 pixel212:0 pixel213:0 pixel214:0.313725490196078 pixel215:0.968627450980392 pixel216:0.992156862745098 pixel217:0.815686274509804 pixel218:0.0509803921568627 pixel219:0 pixel220:0 pixel221:0 pixel222:0 pixel223:0 pixel224:0 pixel225:0 pixel226:0 pixel227:0 pixel228:0 pixel229:0 pixel230:0 pixel231:0 pixel232:0 pixel233:0 pixel234:0 pixel235:0 pixel236:0 pixel237:0 pixel238:0 pixel239:0 pixel240:0 pixel241:0.113725490196078 pixel242:0.811764705882353 pixel243:0.992156862745098 pixel244:0.92156862745098 pixel245:0.301960784313725 pixel246:0 pixel247:0 pixel248:0 pixel249:0 pixel250:0 pixel251:0 pixel252:0 pixel253:0 pixel254:0 pixel255:0 pixel256:0 pixel257:0 pixel258:0 pixel259:0 pixel260:0 pixel261:0 pixel262:0 pixel263:0 pixel264:0 pixel265:0 pixel266:0 pixel267:0 pixel268:0.211764705882353 pixel269:0.819607843137255 pixel270:0.992156862745098 pixel271:0.992156862745098 pixel272:0.345098039215686 pixel273:0 pixel274:0 pixel275:0 pixel276:0 pixel277:0 pixel278:0 pixel279:0 pixel280:0 pixel281:0 pixel282:0 pixel283:0 pixel284:0 pixel285:0 pixel286:0 pixel287:0 pixel288:0 pixel289:0 pixel290:0 pixel291:0 pixel292:0 pixel293:0 pixel294:0 pixel295:0.364705882352941 pixel296:0.996078431372549 pixel297:0.992156862745098 pixel298:0.933333333333333 pixel299:0.666666666666667 pixel300:0.0666666666666667 pixel301:0 pixel302:0 pixel303:0 pixel304:0 pixel305:0 pixel306:0 pixel307:0 pixel308:0 pixel309:0 pixel310:0 pixel311:0 pixel312:0 pixel313:0 pixel314:0 pixel315:0 pixel316:0 pixel317:0 pixel318:0 pixel319:0 pixel320:0 pixel321:0 pixel322:0.0901960784313725 pixel323:0.823529411764706 pixel324:0.996078431372549 pixel325:0.992156862745098 pixel326:0.623529411764706 pixel327:0 pixel328:0 pixel329:0 pixel330:0 pixel331:0 pixel332:0 pixel333:0 pixel334:0 pixel335:0 pixel336:0 pixel337:0 pixel338:0 pixel339:0 pixel340:0 pixel341:0 pixel342:0 pixel343:0 pixel344:0 pixel345:0 pixel346:0 pixel347:0 pixel348:0 pixel349:0.0627450980392157 pixel350:0.819607843137255 pixel351:0.992156862745098 pixel352:0.996078431372549 pixel353:0.941176470588235 pixel354:0.317647058823529 pixel355:0 pixel356:0 pixel357:0 pixel358:0 pixel359:0 pixel360:0 pixel361:0 pixel362:0 pixel363:0 pixel364:0 pixel365:0 pixel366:0 pixel367:0 pixel368:0 pixel369:0 pixel370:0 pixel371:0 pixel372:0 pixel373:0 pixel374:0 pixel375:0 pixel376:0 pixel377:0.105882352941176 pixel378:0.992156862745098 pixel379:0.992156862745098 pixel380:0.996078431372549 pixel381:0.0509803921568627 pixel382:0 pixel383:0 pixel384:0 pixel385:0 pixel386:0 pixel387:0 pixel388:0 pixel389:0 pixel390:0 pixel391:0 pixel392:0 pixel393:0 pixel394:0 pixel395:0 pixel396:0 pixel397:0 pixel398:0 pixel399:0 pixel400:0 pixel401:0 pixel402:0 pixel403:0 pixel404:0.0784313725490196 pixel405:0.807843137254902 pixel406:0.996078431372549 pixel407:0.996078431372549 pixel408:0.776470588235294 pixel409:0.0274509803921569 pixel410:0 pixel411:0 pixel412:0 pixel413:0 pixel414:0 pixel415:0 pixel416:0 pixel417:0 pixel418:0 pixel419:0 pixel420:0 pixel421:0 pixel422:0 pixel423:0 pixel424:0 pixel425:0 pixel426:0 pixel427:0 pixel428:0 pixel429:0 pixel430:0 pixel431:0 pixel432:0.658823529411765 pixel433:0.992156862745098 pixel434:0.992156862745098 pixel435:0.768627450980392 pixel436:0.0274509803921569 pixel437:0 pixel438:0 pixel439:0 pixel440:0 pixel441:0 pixel442:0 pixel443:0 pixel444:0 pixel445:0 pixel446:0 pixel447:0 pixel448:0 pixel449:0 pixel450:0 pixel451:0 pixel452:0 pixel453:0 pixel454:0 pixel455:0 pixel456:0 pixel457:0 pixel458:0 pixel459:0.0784313725490196 pixel460:0.796078431372549 pixel461:0.992156862745098 pixel462:0.972549019607843 pixel463:0.298039215686275 pixel464:0 pixel465:0 pixel466:0 pixel467:0 pixel468:0 pixel469:0 pixel470:0 pixel471:0 pixel472:0 pixel473:0 pixel474:0 pixel475:0 pixel476:0 pixel477:0 pixel478:0 pixel479:0 pixel480:0 pixel481:0 pixel482:0 pixel483:0 pixel484:0 pixel485:0 pixel486:0.0862745098039216 pixel487:0.737254901960784 pixel488:0.992156862745098 pixel489:0.96078431372549 pixel490:0.364705882352941 pixel491:0 pixel492:0 pixel493:0 pixel494:0 pixel495:0 pixel496:0 pixel497:0 pixel498:0 pixel499:0 pixel500:0 pixel501:0 pixel502:0 pixel503:0 pixel504:0 pixel505:0 pixel506:0 pixel507:0 pixel508:0 pixel509:0 pixel510:0 pixel511:0 pixel512:0 pixel513:0 pixel514:0.403921568627451 pixel515:0.992156862745098 pixel516:0.992156862745098 pixel517:0.749019607843137 pixel518:0 pixel519:0 pixel520:0 pixel521:0 pixel522:0 pixel523:0 pixel524:0 pixel525:0 pixel526:0 pixel527:0 pixel528:0 pixel529:0 pixel530:0 pixel531:0 pixel532:0 pixel533:0 pixel534:0 pixel535:0 pixel536:0 pixel537:0 pixel538:0 pixel539:0 pixel540:0 pixel541:0.349019607843137 pixel542:0.941176470588235 pixel543:0.992156862745098 pixel544:0.764705882352941 pixel545:0.0980392156862745 pixel546:0 pixel547:0 pixel548:0 pixel549:0 pixel550:0 pixel551:0 pixel552:0 pixel553:0 pixel554:0 pixel555:0 pixel556:0 pixel557:0 pixel558:0 pixel559:0 pixel560:0 pixel561:0 pixel562:0 pixel563:0 pixel564:0 pixel565:0 pixel566:0 pixel567:0 pixel568:0.0588235294117647 pixel569:0.862745098039216 pixel570:0.992156862745098 pixel571:0.992156862745098 pixel572:0.313725490196078 pixel573:0 pixel574:0 pixel575:0 pixel576:0 pixel577:0 pixel578:0 pixel579:0 pixel580:0 pixel581:0 pixel582:0 pixel583:0 pixel584:0 pixel585:0 pixel586:0 pixel587:0 pixel588:0 pixel589:0 pixel590:0 pixel591:0 pixel592:0 pixel593:0 pixel594:0 pixel595:0 pixel596:0.368627450980392 pixel597:0.992156862745098 pixel598:0.992156862745098 pixel599:0.992156862745098 pixel600:0.368627450980392 pixel601:0 pixel602:0 pixel603:0 pixel604:0 pixel605:0 pixel606:0 pixel607:0 pixel608:0 pixel609:0 pixel610:0 pixel611:0 pixel612:0 pixel613:0 pixel614:0 pixel615:0 pixel616:0 pixel617:0 pixel618:0 pixel619:0 pixel620:0 pixel621:0 pixel622:0 pixel623:0 pixel624:0.349019607843137 pixel625:0.984313725490196 pixel626:0.992156862745098 pixel627:0.980392156862745 pixel628:0.513725490196078 pixel629:0 pixel630:0 pixel631:0 pixel632:0 pixel633:0 pixel634:0 pixel635:0 pixel636:0 pixel637:0 pixel638:0 pixel639:0 pixel640:0 pixel641:0 pixel642:0 pixel643:0 pixel644:0 pixel645:0 pixel646:0 pixel647:0 pixel648:0 pixel649:0 pixel650:0 pixel651:0 pixel652:0 pixel653:0.83921568627451 pixel654:0.854901960784314 pixel655:0.372549019607843 pixel656:0 pixel657:0 pixel658:0 pixel659:0 pixel660:0 pixel661:0 pixel662:0 pixel663:0 pixel664:0 pixel665:0 pixel666:0 pixel667:0 pixel668:0 pixel669:0 pixel670:0 pixel671:0 pixel672:0 pixel673:0 pixel674:0 pixel675:0 pixel676:0 pixel677:0 pixel678:0 pixel679:0 pixel680:0 pixel681:0 pixel682:0 pixel683:0 pixel684:0 pixel685:0 pixel686:0 pixel687:0 pixel688:0 pixel689:0 pixel690:0 pixel691:0 pixel692:0 pixel693:0 pixel694:0 pixel695:0 pixel696:0 pixel697:0 pixel698:0 pixel699:0 pixel700:0 pixel701:0 pixel702:0 pixel703:0 pixel704:0 pixel705:0 pixel706:0 pixel707:0 pixel708:0 pixel709:0 pixel710:0 pixel711:0 pixel712:0 pixel713:0 pixel714:0 pixel715:0 pixel716:0 pixel717:0 pixel718:0 pixel719:0 pixel720:0 pixel721:0 pixel722:0 pixel723:0 pixel724:0 pixel725:0 pixel726:0 pixel727:0 pixel728:0 pixel729:0 pixel730:0 pixel731:0 pixel732:0 pixel733:0 pixel734:0 pixel735:0 pixel736:0 pixel737:0 pixel738:0 pixel739:0 pixel740:0 pixel741:0 pixel742:0 pixel743:0 pixel744:0 pixel745:0 pixel746:0 pixel747:0 pixel748:0 pixel749:0 pixel750:0 pixel751:0 pixel752:0 pixel753:0 pixel754:0 pixel755:0 pixel756:0 pixel757:0 pixel758:0 pixel759:0 pixel760:0 pixel761:0 pixel762:0 pixel763:0 pixel764:0 pixel765:0 pixel766:0 pixel767:0 pixel768:0 pixel769:0 pixel770:0 pixel771:0 pixel772:0 pixel773:0 pixel774:0 pixel775:0 pixel776:0 pixel777:0 pixel778:0 pixel779:0 pixel780:0 pixel781:0 pixel782:0 pixel783:0 pixel784:0
10 |image pixel1:0 pixel2:0 pixel3:0 pixel4:0 pixel5:0 pixel6:0 pixel7:0 pixel8:0 pixel9:0 pixel10:0 pixel11:0 pixel12:0 pixel13:0 pixel14:0 pixel15:0 pixel16:0 pixel17:0 pixel18:0 pixel19:0 pixel20:0 pixel21:0 pixel22:0 pixel23:0 pixel24:0 pixel25:0 pixel26:0 pixel27:0 pixel28:0 pixel29:0 pixel30:0 pixel31:0 pixel32:0 pixel33:0 pixel34:0 pixel35:0 pixel36:0 pixel37:0 pixel38:0 pixel39:0 pixel40:0 pixel41:0 pixel42:0 pixel43:0 pixel44:0 pixel45:0 pixel46:0 pixel47:0 pixel48:0 pixel49:0 pixel50:0 pixel51:0 pixel52:0 pixel53:0 pixel54:0 pixel55:0 pixel56:0 pixel57:0 pixel58:0 pixel59:0 pixel60:0 pixel61:0 pixel62:0 pixel63:0 pixel64:0 pixel65:0 pixel66:0 pixel67:0 pixel68:0 pixel69:0 pixel70:0 pixel71:0 pixel72:0 pixel73:0 pixel74:0 pixel75:0 pixel76:0 pixel77:0 pixel78:0 pixel79:0 pixel80:0 pixel81:0 pixel82:0 pixel83:0 pixel84:0 pixel85:0 pixel86:0 pixel87:0 pixel88:0 pixel89:0 pixel90:0 pixel91:0 pixel92:0 pixel93:0 pixel94:0 pixel95:0 pixel96:0 pixel97:0 pixel98:0 pixel99:0 pixel100:0 pixel101:0 pixel102:0 pixel103:0 pixel104:0 pixel105:0 pixel106:0 pixel107:0 pixel108:0 pixel109:0 pixel110:0 pixel111:0 pixel112:0 pixel113:0 pixel114:0 pixel115:0 pixel116:0 pixel117:0 pixel118:0 pixel119:0 pixel120:0 pixel121:0 pixel122:0 pixel123:0.0705882352941176 pixel124:0.117647058823529 pixel125:0.537254901960784 pixel126:0.537254901960784 pixel127:0.752941176470588 pixel128:0.337254901960784 pixel129:0.282352941176471 pixel130:0.00392156862745098 pixel131:0 pixel132:0 pixel133:0 pixel134:0 pixel135:0 pixel136:0 pixel137:0 pixel138:0 pixel139:0 pixel140:0 pixel141:0 pixel142:0 pixel143:0 pixel144:0 pixel145:0 pixel146:0 pixel147:0 pixel148:0 pixel149:0.0509803921568627 pixel150:0.337254901960784 pixel151:0.980392156862745 pixel152:0.996078431372549 pixel153:0.996078431372549 pixel154:0.996078431372549 pixel155:0.996078431372549 pixel156:0.850980392156863 pixel157:0.964705882352941 pixel158:0.592156862745098 pixel159:0.125490196078431 pixel160:0 pixel161:0 pixel162:0 pixel163:0 pixel164:0 pixel165:0 pixel166:0 pixel167:0 pixel168:0 pixel169:0 pixel170:0 pixel171:0 pixel172:0 pixel173:0 pixel174:0 pixel175:0 pixel176:0.0627450980392157 pixel177:0.701960784313725 pixel178:0.996078431372549 pixel179:0.996078431372549 pixel180:0.996078431372549 pixel181:0.996078431372549 pixel182:0.996078431372549 pixel183:0.996078431372549 pixel184:0.996078431372549 pixel185:0.996078431372549 pixel186:0.996078431372549 pixel187:0.905882352941176 pixel188:0.211764705882353 pixel189:0.0588235294117647 pixel190:0 pixel191:0 pixel192:0 pixel193:0 pixel194:0 pixel195:0 pixel196:0 pixel197:0 pixel198:0 pixel199:0 pixel200:0 pixel201:0 pixel202:0 pixel203:0 pixel204:0.282352941176471 pixel205:0.996078431372549 pixel206:0.996078431372549 pixel207:0.996078431372549 pixel208:0.996078431372549 pixel209:0.996078431372549 pixel210:0.996078431372549 pixel211:0.996078431372549 pixel212:0.996078431372549 pixel213:0.996078431372549 pixel214:0.996078431372549 pixel215:0.996078431372549 pixel216:0.996078431372549 pixel217:0.407843137254902 pixel218:0 pixel219:0 pixel220:0 pixel221:0 pixel222:0 pixel223:0 pixel224:0 pixel225:0 pixel226:0 pixel227:0 pixel228:0 pixel229:0 pixel230:0 pixel231:0.23921568627451 pixel232:0.749019607843137 pixel233:0.996078431372549 pixel234:0.996078431372549 pixel235:0.996078431372549 pixel236:0.996078431372549 pixel237:0.996078431372549 pixel238:0.427450980392157 pixel239:0.325490196078431 pixel240:0.780392156862745 pixel241:0.996078431372549 pixel242:0.996078431372549 pixel243:0.996078431372549 pixel244:0.996078431372549 pixel245:0.952941176470588 pixel246:0.333333333333333 pixel247:0 pixel248:0 pixel249:0 pixel250:0 pixel251:0 pixel252:0 pixel253:0 pixel254:0 pixel255:0 pixel256:0 pixel257:0 pixel258:0 pixel259:0.674509803921569 pixel260:0.996078431372549 pixel261:0.996078431372549 pixel262:0.996078431372549 pixel263:0.792156862745098 pixel264:0.576470588235294 pixel265:0.576470588235294 pixel266:0.176470588235294 pixel267:0 pixel268:0.0431372549019608 pixel269:0.113725490196078 pixel270:0.784313725490196 pixel271:0.996078431372549 pixel272:0.996078431372549 pixel273:0.996078431372549 pixel274:0.670588235294118 pixel275:0 pixel276:0 pixel277:0 pixel278:0 pixel279:0 pixel280:0 pixel281:0 pixel282:0 pixel283:0 pixel284:0 pixel285:0 pixel286:0.00392156862745098 pixel287:0.682352941176471 pixel288:0.996078431372549 pixel289:0.996078431372549 pixel290:0.349019607843137 pixel291:0.262745098039216 pixel292:0 pixel293:0 pixel294:0 pixel295:0 pixel296:0 pixel297:0 pixel298:0.501960784313725 pixel299:0.988235294117647 pixel300:0.996078431372549 pixel301:0.996078431372549 pixel302:0.831372549019608 pixel303:0.298039215686275 pixel304:0 pixel305:0 pixel306:0 pixel307:0 pixel308:0 pixel309:0 pixel310:0 pixel311:0 pixel312:0 pixel313:0 pixel314:0.184313725490196 pixel315:0.996078431372549 pixel316:0.996078431372549 pixel317:0.996078431372549 pixel318:0.113725490196078 pixel319:0 pixel320:0 pixel321:0 pixel322:0 pixel323:0 pixel324:0 pixel325:0 pixel326:0 pixel327:0.325490196078431 pixel328:0.996078431372549 pixel329:0.996078431372549 pixel330:0.996078431372549 pixel331:0.6 pixel332:0 pixel333:0 pixel334:0 pixel335:0 pixel336:0 pixel337:0 pixel338:0 pixel339:0 pixel340:0 pixel341:0 pixel342:0.313725490196078 pixel343:0.996078431372549 pixel344:0.996078431372549 pixel345:0.941176470588235 pixel346:0.0941176470588235 pixel347:0 pixel348:0 pixel349:0 pixel350:0 pixel351:0 pixel352:0 pixel353:0 pixel354:0 pixel355:0.0980392156862745 pixel356:0.941176470588235 pixel357:0.996078431372549 pixel358:0.996078431372549 pixel359:0.6 pixel360:0 pixel361:0 pixel362:0 pixel363:0 pixel364:0 pixel365:0 pixel366:0 pixel367:0 pixel368:0 pixel369:0 pixel370:0.250980392156863 pixel371:0.996078431372549 pixel372:0.996078431372549 pixel373:0.729411764705882 pixel374:0.0274509803921569 pixel375:0 pixel376:0 pixel377:0 pixel378:0 pixel379:0 pixel380:0 pixel381:0 pixel382:0 pixel383:0 pixel384:0.650980392156863 pixel385:0.996078431372549 pixel386:0.996078431372549 pixel387:0.87843137254902 pixel388:0.0470588235294118 pixel389:0 pixel390:0 pixel391:0 pixel392:0 pixel393:0 pixel394:0 pixel395:0 pixel396:0 pixel397:0.0549019607843137 pixel398:0.909803921568627 pixel399:0.996078431372549 pixel400:0.996078431372549 pixel401:0.996078431372549 pixel402:0.113725490196078 pixel403:0 pixel404:0 pixel405:0 pixel406:0 pixel407:0 pixel408:0 pixel409:0 pixel410:0 pixel411:0 pixel412:0.294117647058824 pixel413:0.996078431372549 pixel414:0.996078431372549 pixel415:0.996078431372549 pixel416:0.0666666666666667 pixel417:0 pixel418:0 pixel419:0 pixel420:0 pixel421:0 pixel422:0 pixel423:0 pixel424:0 pixel425:0.0705882352941176 pixel426:0.996078431372549 pixel427:0.996078431372549 pixel428:0.996078431372549 pixel429:0.996078431372549 pixel430:0.113725490196078 pixel431:0 pixel432:0 pixel433:0 pixel434:0 pixel435:0 pixel436:0 pixel437:0 pixel438:0 pixel439:0 pixel440:0.188235294117647 pixel441:0.996078431372549 pixel442:0.996078431372549 pixel443:0.996078431372549 pixel444:0.0666666666666667 pixel445:0 pixel446:0 pixel447:0 pixel448:0 pixel449:0 pixel450:0 pixel451:0 pixel452:0 pixel453:0.00784313725490196 pixel454:0.63921568627451 pixel455:0.996078431372549 pixel456:0.996078431372549 pixel457:0.996078431372549 pixel458:0.113725490196078 pixel459:0 pixel460:0 pixel461:0 pixel462:0 pixel463:0 pixel464:0 pixel465:0 pixel466:0 pixel467:0 pixel468:0.188235294117647 pixel469:0.996078431372549 pixel470:0.996078431372549 pixel471:0.996078431372549 pixel472:0.0666666666666667 pixel473:0 pixel474:0 pixel475:0 pixel476:0 pixel477:0 pixel478:0 pixel479:0 pixel480:0 pixel481:0 pixel482:0.368627450980392 pixel483:0.996078431372549 pixel484:0.996078431372549 pixel485:0.996078431372549 pixel486:0.784313725490196 pixel487:0.0470588235294118 pixel488:0 pixel489:0 pixel490:0 pixel491:0 pixel492:0 pixel493:0 pixel494:0 pixel495:0.0627450980392157 pixel496:0.819607843137255 pixel497:0.996078431372549 pixel498:0.996078431372549 pixel499:0.588235294117647 pixel500:0.00392156862745098 pixel501:0 pixel502:0 pixel503:0 pixel504:0 pixel505:0 pixel506:0 pixel507:0 pixel508:0 pixel509:0 pixel510:0.0588235294117647 pixel511:0.807843137254902 pixel512:0.996078431372549 pixel513:0.996078431372549 pixel514:0.996078431372549 pixel515:0.792156862745098 pixel516:0.258823529411765 pixel517:0 pixel518:0 pixel519:0 pixel520:0 pixel521:0 pixel522:0.0823529411764706 pixel523:0.631372549019608 pixel524:0.996078431372549 pixel525:0.996078431372549 pixel526:0.96078431372549 pixel527:0.12156862745098 pixel528:0 pixel529:0 pixel530:0 pixel531:0 pixel532:0 pixel533:0 pixel534:0 pixel535:0 pixel536:0 pixel537:0 pixel538:0 pixel539:0.235294117647059 pixel540:0.831372549019608 pixel541:0.996078431372549 pixel542:0.996078431372549 pixel543:0.996078431372549 pixel544:0.76078431372549 pixel545:0.188235294117647 pixel546:0.188235294117647 pixel547:0.133333333333333 pixel548:0.16078431372549 pixel549:0.188235294117647 pixel550:0.819607843137255 pixel551:0.996078431372549 pixel552:0.996078431372549 pixel553:0.996078431372549 pixel554:0.670588235294118 pixel555:0 pixel556:0 pixel557:0 pixel558:0 pixel559:0 pixel560:0 pixel561:0 pixel562:0 pixel563:0 pixel564:0 pixel565:0 pixel566:0 pixel567:0 pixel568:0.337254901960784 pixel569:0.952941176470588 pixel570:0.996078431372549 pixel571:0.996078431372549 pixel572:0.996078431372549 pixel573:0.996078431372549 pixel574:0.996078431372549 pixel575:0.913725490196078 pixel576:0.952941176470588 pixel577:0.996078431372549 pixel578:0.996078431372549 pixel579:0.996078431372549 pixel580:0.996078431372549 pixel581:0.996078431372549 pixel582:0.337254901960784 pixel583:0 pixel584:0 pixel585:0 pixel586:0 pixel587:0 pixel588:0 pixel589:0 pixel590:0 pixel591:0 pixel592:0 pixel593:0 pixel594:0 pixel595:0 pixel596:0 pixel597:0.447058823529412 pixel598:0.996078431372549 pixel599:0.996078431372549 pixel600:0.996078431372549 pixel601:0.996078431372549 pixel602:0.996078431372549 pixel603:0.996078431372549 pixel604:0.996078431372549 pixel605:0.996078431372549 pixel606:0.996078431372549 pixel607:0.996078431372549 pixel608:0.937254901960784 pixel609:0.337254901960784 pixel610:0.0431372549019608 pixel611:0 pixel612:0 pixel613:0 pixel614:0 pixel615:0 pixel616:0 pixel617:0 pixel618:0 pixel619:0 pixel620:0 pixel621:0 pixel622:0 pixel623:0 pixel624:0 pixel625:0.0509803921568627 pixel626:0.713725490196078 pixel627:0.996078431372549 pixel628:0.996078431372549 pixel629:0.996078431372549 pixel630:0.996078431372549 pixel631:0.996078431372549 pixel632:0.996078431372549 pixel633:0.996078431372549 pixel634:0.996078431372549 pixel635:0.952941176470588 pixel636:0.274509803921569 pixel637:0 pixel638:0 pixel639:0 pixel640:0 pixel641:0 pixel642:0 pixel643:0 pixel644:0 pixel645:0 pixel646:0 pixel647:0 pixel648:0 pixel649:0 pixel650:0 pixel651:0 pixel652:0 pixel653:0 pixel654:0.0313725490196078 pixel655:0.298039215686275 pixel656:0.572549019607843 pixel657:0.996078431372549 pixel658:1 pixel659:0.996078431372549 pixel660:1 pixel661:0.572549019607843 pixel662:0.0745098039215686 pixel663:0.0588235294117647 pixel664:0 pixel665:0 pixel666:0 pixel667:0 pixel668:0 pixel669:0 pixel670:0 pixel671:0 pixel672:0 pixel673:0 pixel674:0 pixel675:0 pixel676:0 pixel677:0 pixel678:0 pixel679:0 pixel680:0 pixel681:0 pixel682:0 pixel683:0 pixel684:0 pixel685:0 pixel686:0 pixel687:0 pixel688:0 pixel689:0 pixel690:0 pixel691:0 pixel692:0 pixel693:0 pixel694:0 pixel695:0 pixel696:0 pixel697:0 pixel698:0 pixel699:0 pixel700:0 pixel701:0 pixel702:0 pixel703:0 pixel704:0 pixel705:0 pixel706:0 pixel707:0 pixel708:0 pixel709:0 pixel710:0 pixel711:0 pixel712:0 pixel713:0 pixel714:0 pixel715:0 pixel716:0 pixel717:0 pixel718:0 pixel719:0 pixel720:0 pixel721:0 pixel722:0 pixel723:0 pixel724:0 pixel725:0 pixel726:0 pixel727:0 pixel728:0 pixel729:0 pixel730:0 pixel731:0 pixel732:0 pixel733:0 pixel734:0 pixel735:0 pixel736:0 pixel737:0 pixel738:0 pixel739:0 pixel740:0 pixel741:0 pixel742:0 pixel743:0 pixel744:0 pixel745:0 pixel746:0 pixel747:0 pixel748:0 pixel749:0 pixel750:0 pixel751:0 pixel752:0 pixel753:0 pixel754:0 pixel755:0 pixel756:0 pixel757:0 pixel758:0 pixel759:0 pixel760:0 pixel761:0 pixel762:0 pixel763:0 pixel764:0 pixel765:0 pixel766:0 pixel767:0 pixel768:0 pixel769:0 pixel770:0 pixel771:0 pixel772:0 pixel773:0 pixel774:0 pixel775:0 pixel776:0 pixel777:0 pixel778:0 pixel779:0 pixel780:0 pixel781:0 pixel782:0 pixel783:0 pixel784:0
1 |image pixel1:0 pixel2:0 pixel3:0 pixel4:0 pixel5:0 pixel6:0 pixel7:0 pixel8:0 pixel9:0 pixel10:0 pixel11:0 pixel12:0 pixel13:0 pixel14:0 pixel15:0 pixel16:0 pixel17:0 pixel18:0 pixel19:0 pixel20:0 pixel21:0 pixel22:0 pixel23:0 pixel24:0 pixel25:0 pixel26:0 pixel27:0 pixel28:0 pixel29:0 pixel30:0 pixel31:0 pixel32:0 pixel33:0 pixel34:0 pixel35:0 pixel36:0 pixel37:0 pixel38:0 pixel39:0 pixel40:0 pixel41:0 pixel42:0 pixel43:0 pixel44:0 pixel45:0 pixel46:0 pixel47:0 pixel48:0 pixel49:0 pixel50:0 pixel51:0 pixel52:0 pixel53:0 pixel54:0 pixel55:0 pixel56:0 pixel57:0 pixel58:0 pixel59:0 pixel60:0 pixel61:0 pixel62:0 pixel63:0 pixel64:0 pixel65:0 pixel66:0 pixel67:0 pixel68:0 pixel69:0 pixel70:0 pixel71:0 pixel72:0 pixel73:0 pixel74:0 pixel75:0 pixel76:0 pixel77:0 pixel78:0 pixel79:0 pixel80:0 pixel81:0 pixel82:0 pixel83:0 pixel84:0 pixel85:0 pixel86:0 pixel87:0 pixel88:0 pixel89:0 pixel90:0 pixel91:0 pixel92:0 pixel93:0 pixel94:0 pixel95:0 pixel96:0 pixel97:0 pixel98:0 pixel99:0 pixel100:0 pixel101:0 pixel102:0 pixel103:0 pixel104:0 pixel105:0 pixel106:0 pixel107:0 pixel108:0 pixel109:0 pixel110:0 pixel111:0 pixel112:0 pixel113:0 pixel114:0 pixel115:0 pixel116:0 pixel117:0 pixel118:0 pixel119:0 pixel120:0 pixel121:0 pixel122:0 pixel123:0 pixel124:0 pixel125:0.0117647058823529 pixel126:0.552941176470588 pixel127:0.545098039215686 pixel128:0.0117647058823529 pixel129:0 pixel130:0 pixel131:0 pixel132:0 pixel133:0 pixel134:0 pixel135:0 pixel136:0 pixel137:0 pixel138:0 pixel139:0 pixel140:0 pixel141:0 pixel142:0 pixel143:0 pixel144:0 pixel145:0 pixel146:0 pixel147:0 pixel148:0 pixel149:0 pixel150:0 pixel151:0 pixel152:0 pixel153:0.0352941176470588 pixel154:0.996078431372549 pixel155:0.996078431372549 pixel156:0.0313725490196078 pixel157:0 pixel158:0 pixel159:0 pixel160:0 pixel161:0 pixel162:0 pixel163:0 pixel164:0 pixel165:0 pixel166:0 pixel167:0 pixel168:0 pixel169:0 pixel170:0 pixel171:0 pixel172:0 pixel173:0 pixel174:0 pixel175:0 pixel176:0 pixel177:0 pixel178:0 pixel179:0 pixel180:0 pixel181:0.0352941176470588 pixel182:0.996078431372549 pixel183:0.996078431372549 pixel184:0.0313725490196078 pixel185:0 pixel186:0 pixel187:0 pixel188:0 pixel189:0 pixel190:0 pixel191:0 pixel192:0 pixel193:0 pixel194:0 pixel195:0 pixel196:0 pixel197:0 pixel198:0 pixel199:0 pixel200:0 pixel201:0 pixel202:0 pixel203:0 pixel204:0 pixel205:0 pixel206:0 pixel207:0 pixel208:0 pixel209:0.0352941176470588 pixel210:0.996078431372549 pixel211:0.996078431372549 pixel212:0.415686274509804 pixel213:0 pixel214:0 pixel215:0 pixel216:0 pixel217:0 pixel218:0 pixel219:0 pixel220:0 pixel221:0 pixel222:0 pixel223:0 pixel224:0 pixel225:0 pixel226:0 pixel227:0 pixel228:0 pixel229:0 pixel230:0 pixel231:0 pixel232:0 pixel233:0 pixel234:0 pixel235:0 pixel236:0 pixel237:0.0352941176470588 pixel238:0.996078431372549 pixel239:0.996078431372549 pixel240:0.72156862745098 pixel241:0 pixel242:0 pixel243:0 pixel244:0 pixel245:0 pixel246:0 pixel247:0 pixel248:0 pixel249:0 pixel250:0 pixel251:0 pixel252:0 pixel253:0 pixel254:0 pixel255:0 pixel256:0 pixel257:0 pixel258:0 pixel259:0 pixel260:0 pixel261:0 pixel262:0 pixel263:0 pixel264:0 pixel265:0.0352941176470588 pixel266:0.996078431372549 pixel267:0.996078431372549 pixel268:0.72156862745098 pixel269:0 pixel270:0 pixel271:0 pixel272:0 pixel273:0 pixel274:0 pixel275:0 pixel276:0 pixel277:0 pixel278:0 pixel279:0 pixel280:0 pixel281:0 pixel282:0 pixel283:0 pixel284:0 pixel285:0 pixel286:0 pixel287:0 pixel288:0 pixel289:0 pixel290:0 pixel291:0 pixel292:0 pixel293:0.0352941176470588 pixel294:0.996078431372549 pixel295:0.996078431372549 pixel296:0.72156862745098 pixel297:0 pixel298:0 pixel299:0 pixel300:0 pixel301:0 pixel302:0 pixel303:0 pixel304:0 pixel305:0 pixel306:0 pixel307:0 pixel308:0 pixel309:0 pixel310:0 pixel311:0 pixel312:0 pixel313:0 pixel314:0 pixel315:0 pixel316:0 pixel317:0 pixel318:0 pixel319:0 pixel320:0 pixel321:0.0235294117647059 pixel322:0.725490196078431 pixel323:0.996078431372549 pixel324:0.72156862745098 pixel325:0 pixel326:0 pixel327:0 pixel328:0 pixel329:0 pixel330:0 pixel331:0 pixel332:0 pixel333:0 pixel334:0 pixel335:0 pixel336:0 pixel337:0 pixel338:0 pixel339:0 pixel340:0 pixel341:0 pixel342:0 pixel343:0 pixel344:0 pixel345:0 pixel346:0 pixel347:0 pixel348:0 pixel349:0 pixel350:0.349019607843137 pixel351:0.996078431372549 pixel352:0.72156862745098 pixel353:0 pixel354:0 pixel355:0 pixel356:0 pixel357:0 pixel358:0 pixel359:0 pixel360:0 pixel361:0 pixel362:0 pixel363:0 pixel364:0 pixel365:0 pixel366:0 pixel367:0 pixel368:0 pixel369:0 pixel370:0 pixel371:0 pixel372:0 pixel373:0 pixel374:0 pixel375:0 pixel376:0 pixel377:0.0156862745098039 pixel378:0.572549019607843 pixel379:0.996078431372549 pixel380:0.72156862745098 pixel381:0 pixel382:0 pixel383:0 pixel384:0 pixel385:0 pixel386:0 pixel387:0 pixel388:0 pixel389:0 pixel390:0 pixel391:0 pixel392:0 pixel393:0 pixel394:0 pixel395:0 pixel396:0 pixel397:0 pixel398:0 pixel399:0 pixel400:0 pixel401:0 pixel402:0 pixel403:0 pixel404:0 pixel405:0.0352941176470588 pixel406:0.996078431372549 pixel407:0.996078431372549 pixel408:0.72156862745098 pixel409:0 pixel410:0 pixel411:0 pixel412:0 pixel413:0 pixel414:0 pixel415:0 pixel416:0 pixel417:0 pixel418:0 pixel419:0 pixel420:0 pixel421:0 pixel422:0 pixel423:0 pixel424:0 pixel425:0 pixel426:0 pixel427:0 pixel428:0 pixel429:0 pixel430:0 pixel431:0 pixel432:0 pixel433:0.0352941176470588 pixel434:0.996078431372549 pixel435:0.996078431372549 pixel436:0.72156862745098 pixel437:0 pixel438:0 pixel439:0 pixel440:0 pixel441:0 pixel442:0 pixel443:0 pixel444:0 pixel445:0 pixel446:0 pixel447:0 pixel448:0 pixel449:0 pixel450:0 pixel451:0 pixel452:0 pixel453:0 pixel454:0 pixel455:0 pixel456:0 pixel457:0 pixel458:0 pixel459:0 pixel460:0 pixel461:0.0352941176470588 pixel462:0.996078431372549 pixel463:0.996078431372549 pixel464:0.72156862745098 pixel465:0 pixel466:0 pixel467:0 pixel468:0 pixel469:0 pixel470:0 pixel471:0 pixel472:0 pixel473:0 pixel474:0 pixel475:0 pixel476:0 pixel477:0 pixel478:0 pixel479:0 pixel480:0 pixel481:0 pixel482:0 pixel483:0 pixel484:0 pixel485:0 pixel486:0 pixel487:0 pixel488:0 pixel489:0.0352941176470588 pixel490:0.996078431372549 pixel491:0.996078431372549 pixel492:0.72156862745098 pixel493:0 pixel494:0 pixel495:0 pixel496:0 pixel497:0 pixel498:0 pixel499:0 pixel500:0 pixel501:0 pixel502:0 pixel503:0 pixel504:0 pixel505:0 pixel506:0 pixel507:0 pixel508:0 pixel509:0 pixel510:0 pixel511:0 pixel512:0 pixel513:0 pixel514:0 pixel515:0 pixel516:0 pixel517:0.0352941176470588 pixel518:0.996078431372549 pixel519:0.996078431372549 pixel520:0.72156862745098 pixel521:0 pixel522:0 pixel523:0 pixel524:0 pixel525:0 pixel526:0 pixel527:0 pixel528:0 pixel529:0 pixel530:0 pixel531:0 pixel532:0 pixel533:0 pixel534:0 pixel535:0 pixel536:0 pixel537:0 pixel538:0 pixel539:0 pixel540:0 pixel541:0 pixel542:0 pixel543:0 pixel544:0 pixel545:0.611764705882353 pixel546:0.996078431372549 pixel547:0.996078431372549 pixel548:0.72156862745098 pixel549:0 pixel550:0 pixel551:0 pixel552:0 pixel553:0 pixel554:0 pixel555:0 pixel556:0 pixel557:0 pixel558:0 pixel559:0 pixel560:0 pixel561:0 pixel562:0 pixel563:0 pixel564:0 pixel565:0 pixel566:0 pixel567:0 pixel568:0 pixel569:0 pixel570:0 pixel571:0 pixel572:0 pixel573:0.725490196078431 pixel574:1 pixel575:1 pixel576:0.72156862745098 pixel577:0 pixel578:0 pixel579:0 pixel580:0 pixel581:0 pixel582:0 pixel583:0 pixel584:0 pixel585:0 pixel586:0 pixel587:0 pixel588:0 pixel589:0 pixel590:0 pixel591:0 pixel592:0 pixel593:0 pixel594:0 pixel595:0 pixel596:0 pixel597:0 pixel598:0 pixel599:0 pixel600:0 pixel601:0.725490196078431 pixel602:0.996078431372549 pixel603:0.996078431372549 pixel604:0.72156862745098 pixel605:0 pixel606:0 pixel607:0 pixel608:0 pixel609:0 pixel610:0 pixel611:0 pixel612:0 pixel613:0 pixel614:0 pixel615:0 pixel616:0 pixel617:0 pixel618:0 pixel619:0 pixel620:0 pixel621:0 pixel622:0 pixel623:0 pixel624:0 pixel625:0 pixel626:0 pixel627:0 pixel628:0 pixel629:0.725490196078431 pixel630:0.996078431372549 pixel631:0.996078431372549 pixel632:0.72156862745098 pixel633:0 pixel634:0 pixel635:0 pixel636:0 pixel637:0 pixel638:0 pixel639:0 pixel640:0 pixel641:0 pixel642:0 pixel643:0 pixel644:0 pixel645:0 pixel646:0 pixel647:0 pixel648:0 pixel649:0 pixel650:0 pixel651:0 pixel652:0 pixel653:0 pixel654:0 pixel655:0 pixel656:0 pixel657:0.247058823529412 pixel658:0.996078431372549 pixel659:0.996078431372549 pixel660:0.243137254901961 pixel661:0 pixel662:0 pixel663:0 pixel664:0 pixel665:0 pixel666:0 pixel667:0 pixel668:0 pixel669:0 pixel670:0 pixel671:0 pixel672:0 pixel673:0 pixel674:0 pixel675:0 pixel676:0 pixel677:0 pixel678:0 pixel679:0 pixel680:0 pixel681:0 pixel682:0 pixel683:0 pixel684:0 pixel685:0 pixel686:0 pixel687:0 pixel688:0 pixel689:0 pixel690:0 pixel691:0 pixel692:0 pixel693:0 pixel694:0 pixel695:0 pixel696:0 pixel697:0 pixel698:0 pixel699:0 pixel700:0 pixel701:0 pixel702:0 pixel703:0 pixel704:0 pixel705:0 pixel706:0 pixel707:0 pixel708:0 pixel709:0 pixel710:0 pixel711:0 pixel712:0 pixel713:0 pixel714:0 pixel715:0 pixel716:0 pixel717:0 pixel718:0 pixel719:0 pixel720:0 pixel721:0 pixel722:0 pixel723:0 pixel724:0 pixel725:0 pixel726:0 pixel727:0 pixel728:0 pixel729:0 pixel730:0 pixel731:0 pixel732:0 pixel733:0 pixel734:0 pixel735:0 pixel736:0 pixel737:0 pixel738:0 pixel739:0 pixel740:0 pixel741:0 pixel742:0 pixel743:0 pixel744:0 pixel745:0 pixel746:0 pixel747:0 pixel748:0 pixel749:0 pixel750:0 pixel751:0 pixel752:0 pixel753:0 pixel754:0 pixel755:0 pixel756:0 pixel757:0 pixel758:0 pixel759:0 pixel760:0 pixel761:0 pixel762:0 pixel763:0 pixel764:0 pixel765:0 pixel766:0 pixel767:0 pixel768:0 pixel769:0 pixel770:0 pixel771:0 pixel772:0 pixel773:0 pixel774:0 pixel775:0 pixel776:0 pixel777:0 pixel778:0 pixel779:0 pixel780:0 pixel781:0 pixel782:0 pixel783:0 pixel784:0
4 |image pixel1:0 pixel2:0 pixel3:0 pixel4:0 pixel5:0 pixel6:0 pixel7:0 pixel8:0 pixel9:0 pixel10:0 pixel11:0 pixel12:0 pixel13:0 pixel14:0 pixel15:0 pixel16:0 pixel17:0 pixel18:0 pixel19:0 pixel20:0 pixel21:0 pixel22:0 pixel23:0 pixel24:0 pixel25:0 pixel26:0 pixel27:0 pixel28:0 pixel29:0 pixel30:0 pixel31:0 pixel32:0 pixel33:0 pixel34:0 pixel35:0 pixel36:0 pixel37:0 pixel38:0 pixel39:0 pixel40:0 pixel41:0 pixel42:0 pixel43:0 pixel44:0 pixel45:0 pixel46:0 pixel47:0 pixel48:0 pixel49:0 pixel50:0 pixel51:0 pixel52:0 pixel53:0 pixel54:0 pixel55:0 pixel56:0 pixel57:0 pixel58:0 pixel59:0 pixel60:0 pixel61:0 pixel62:0 pixel63:0 pixel64:0 pixel65:0 pixel66:0 pixel67:0 pixel68:0 pixel69:0 pixel70:0 pixel71:0 pixel72:0 pixel73:0 pixel74:0 pixel75:0 pixel76:0 pixel77:0 pixel78:0 pixel79:0 pixel80:0 pixel81:0 pixel82:0 pixel83:0 pixel84:0 pixel85:0 pixel86:0 pixel87:0 pixel88:0 pixel89:0 pixel90:0 pixel91:0 pixel92:0 pixel93:0 pixel94:0 pixel95:0 pixel96:0 pixel97:0 pixel98:0 pixel99:0 pixel100:0 pixel101:0 pixel102:0 pixel103:0 pixel104:0 pixel105:0 pixel106:0 pixel107:0 pixel108:0 pixel109:0 pixel110:0 pixel111:0 pixel112:0 pixel113:0 pixel114:0 pixel115:0 pixel116:0 pixel117:0 pixel118:0 pixel119:0 pixel120:0 pixel121:0 pixel122:0 pixel123:0 pixel124:0 pixel125:0 pixel126:0 pixel127:0 pixel128:0 pixel129:0 pixel130:0 pixel131:0 pixel132:0 pixel133:0 pixel134:0 pixel135:0 pixel136:0 pixel137:0 pixel138:0 pixel139:0 pixel140:0 pixel141:0 pixel142:0 pixel143:0 pixel144:0 pixel145:0 pixel146:0 pixel147:0.862745098039216 pixel148:0.701960784313725 pixel149:0.0235294117647059 pixel150:0 pixel151:0 pixel152:0 pixel153:0 pixel154:0 pixel155:0 pixel156:0 pixel157:0 pixel158:0.0352941176470588 pixel159:0.301960784313725 pixel160:0 pixel161:0 pixel162:0 pixel163:0 pixel164:0 pixel165:0 pixel166:0 pixel167:0 pixel168:0 pixel169:0 pixel170:0 pixel171:0 pixel172:0 pixel173:0 pixel174:0 pixel175:0.109803921568627 pixel176:0.968627450980392 pixel177:0.0666666666666667 pixel178:0 pixel179:0 pixel180:0 pixel181:0 pixel182:0 pixel183:0 pixel184:0 pixel185:0 pixel186:0.105882352941176 pixel187:0.792156862745098 pixel188:0 pixel189:0 pixel190:0 pixel191:0 pixel192:0 pixel193:0 pixel194:0 pixel195:0 pixel196:0 pixel197:0 pixel198:0 pixel199:0 pixel200:0 pixel201:0 pixel202:0 pixel203:0 pixel204:0.949019607843137 pixel205:0.607843137254902 pixel206:0 pixel207:0 pixel208:0 pixel209:0 pixel210:0 pixel211:0 pixel212:0 pixel213:0 pixel214:0.105882352941176 pixel215:0.996078431372549 pixel216:0.247058823529412 pixel217:0 pixel218:0 pixel219:0 pixel220:0 pixel221:0 pixel222:0 pixel223:0 pixel224:0 pixel225:0 pixel226:0 pixel227:0 pixel228:0 pixel229:0 pixel230:0 pixel231:0 pixel232:0.627450980392157 pixel233:0.811764705882353 pixel234:0.0235294117647059 pixel235:0 pixel236:0 pixel237:0 pixel238:0 pixel239:0 pixel240:0 pixel241:0 pixel242:0.105882352941176 pixel243:0.996078431372549 pixel244:0.254901960784314 pixel245:0 pixel246:0 pixel247:0 pixel248:0 pixel249:0 pixel250:0 pixel251:0 pixel252:0 pixel253:0 pixel254:0 pixel255:0 pixel256:0 pixel257:0 pixel258:0 pixel259:0 pixel260:0.498039215686275 pixel261:0.996078431372549 pixel262:0.0823529411764706 pixel263:0 pixel264:0 pixel265:0 pixel266:0 pixel267:0 pixel268:0 pixel269:0 pixel270:0.0784313725490196 pixel271:0.937254901960784 pixel272:0.254901960784314 pixel273:0 pixel274:0 pixel275:0 pixel276:0 pixel277:0 pixel278:0 pixel279:0 pixel280:0 pixel281:0 pixel282:0 pixel283:0 pixel284:0 pixel285:0 pixel286:0 pixel287:0 pixel288:0.301960784313725 pixel289:0.996078431372549 pixel290:0.0823529411764706 pixel291:0 pixel292:0 pixel293:0 pixel294:0 pixel295:0 pixel296:0 pixel297:0 pixel298:0 pixel299:0.764705882352941 pixel300:0.254901960784314 pixel301:0 pixel302:0 pixel303:0 pixel304:0 pixel305:0 pixel306:0 pixel307:0 pixel308:0 pixel309:0 pixel310:0 pixel311:0 pixel312:0 pixel313:0 pixel314:0 pixel315:0 pixel316:0.274509803921569 pixel317:0.996078431372549 pixel318:0.0823529411764706 pixel319:0 pixel320:0 pixel321:0 pixel322:0 pixel323:0 pixel324:0 pixel325:0 pixel326:0 pixel327:0.764705882352941 pixel328:0.556862745098039 pixel329:0 pixel330:0 pixel331:0 pixel332:0 pixel333:0 pixel334:0 pixel335:0 pixel336:0 pixel337:0 pixel338:0 pixel339:0 pixel340:0 pixel341:0 pixel342:0 pixel343:0 pixel344:0.219607843137255 pixel345:0.984313725490196 pixel346:0.0823529411764706 pixel347:0 pixel348:0 pixel349:0 pixel350:0 pixel351:0 pixel352:0 pixel353:0 pixel354:0 pixel355:0.764705882352941 pixel356:0.890196078431372 pixel357:0 pixel358:0 pixel359:0 pixel360:0 pixel361:0 pixel362:0 pixel363:0 pixel364:0 pixel365:0 pixel366:0 pixel367:0 pixel368:0 pixel369:0 pixel370:0 pixel371:0 pixel372:0 pixel373:0.870588235294118 pixel374:0.6 pixel375:0.0196078431372549 pixel376:0 pixel377:0 pixel378:0 pixel379:0 pixel380:0 pixel381:0 pixel382:0 pixel383:0.470588235294118 pixel384:0.941176470588235 pixel385:0.0509803921568627 pixel386:0 pixel387:0 pixel388:0 pixel389:0 pixel390:0 pixel391:0 pixel392:0 pixel393:0 pixel394:0 pixel395:0 pixel396:0 pixel397:0 pixel398:0 pixel399:0 pixel400:0 pixel401:0.262745098039216 pixel402:0.984313725490196 pixel403:0.156862745098039 pixel404:0 pixel405:0 pixel406:0 pixel407:0 pixel408:0 pixel409:0 pixel410:0 pixel411:0.368627450980392 pixel412:1 pixel413:0.270588235294118 pixel414:0 pixel415:0 pixel416:0 pixel417:0 pixel418:0 pixel419:0 pixel420:0 pixel421:0 pixel422:0 pixel423:0 pixel424:0 pixel425:0 pixel426:0 pixel427:0 pixel428:0 pixel429:0 pixel430:0.917647058823529 pixel431:0.72156862745098 pixel432:0 pixel433:0 pixel434:0 pixel435:0 pixel436:0 pixel437:0 pixel438:0 pixel439:0.0745098039215686 pixel440:0.96078431372549 pixel441:0.270588235294118 pixel442:0 pixel443:0 pixel444:0 pixel445:0 pixel446:0 pixel447:0 pixel448:0 pixel449:0 pixel450:0 pixel451:0 pixel452:0 pixel453:0 pixel454:0 pixel455:0 pixel456:0 pixel457:0 pixel458:0.917647058823529 pixel459:0.662745098039216 pixel460:0 pixel461:0 pixel462:0 pixel463:0 pixel464:0 pixel465:0 pixel466:0 pixel467:0.0117647058823529 pixel468:0.780392156862745 pixel469:0.713725490196078 pixel470:0.0392156862745098 pixel471:0 pixel472:0 pixel473:0 pixel474:0 pixel475:0 pixel476:0 pixel477:0 pixel478:0 pixel479:0 pixel480:0 pixel481:0 pixel482:0 pixel483:0 pixel484:0 pixel485:0 pixel486:0.603921568627451 pixel487:0.803921568627451 pixel488:0.0156862745098039 pixel489:0 pixel490:0 pixel491:0.101960784313725 pixel492:0.282352941176471 pixel493:0.501960784313725 pixel494:0.796078431372549 pixel495:0.815686274509804 pixel496:0.996078431372549 pixel497:0.996078431372549 pixel498:0.513725490196078 pixel499:0 pixel500:0 pixel501:0 pixel502:0 pixel503:0 pixel504:0 pixel505:0 pixel506:0 pixel507:0 pixel508:0 pixel509:0 pixel510:0 pixel511:0 pixel512:0 pixel513:0 pixel514:0.23921568627451 pixel515:0.996078431372549 pixel516:0.505882352941176 pixel517:0.443137254901961 pixel518:0.729411764705882 pixel519:0.96078431372549 pixel520:0.984313725490196 pixel521:0.741176470588235 pixel522:0.294117647058824 pixel523:0.219607843137255 pixel524:0.533333333333333 pixel525:0.996078431372549 pixel526:0.286274509803922 pixel527:0 pixel528:0 pixel529:0 pixel530:0 pixel531:0 pixel532:0 pixel533:0 pixel534:0 pixel535:0 pixel536:0 pixel537:0 pixel538:0 pixel539:0 pixel540:0 pixel541:0 pixel542:0.0588235294117647 pixel543:0.847058823529412 pixel544:0.913725490196078 pixel545:0.913725490196078 pixel546:0.623529411764706 pixel547:0.407843137254902 pixel548:0.203921568627451 pixel549:0 pixel550:0 pixel551:0 pixel552:0.149019607843137 pixel553:0.996078431372549 pixel554:0.286274509803922 pixel555:0 pixel556:0 pixel557:0 pixel558:0 pixel559:0 pixel560:0 pixel561:0 pixel562:0 pixel563:0 pixel564:0 pixel565:0 pixel566:0 pixel567:0 pixel568:0 pixel569:0 pixel570:0 pixel571:0 pixel572:0 pixel573:0 pixel574:0 pixel575:0 pixel576:0 pixel577:0 pixel578:0 pixel579:0 pixel580:0.0705882352941176 pixel581:0.996078431372549 pixel582:0.286274509803922 pixel583:0 pixel584:0 pixel585:0 pixel586:0 pixel587:0 pixel588:0 pixel589:0 pixel590:0 pixel591:0 pixel592:0 pixel593:0 pixel594:0 pixel595:0 pixel596:0 pixel597:0 pixel598:0 pixel599:0 pixel600:0 pixel601:0 pixel602:0 pixel603:0 pixel604:0 pixel605:0 pixel606:0 pixel607:0 pixel608:0.0705882352941176 pixel609:0.996078431372549 pixel610:0.286274509803922 pixel611:0 pixel612:0 pixel613:0 pixel614:0 pixel615:0 pixel616:0 pixel617:0 pixel618:0 pixel619:0 pixel620:0 pixel621:0 pixel622:0 pixel623:0 pixel624:0 pixel625:0 pixel626:0 pixel627:0 pixel628:0 pixel629:0 pixel630:0 pixel631:0 pixel632:0 pixel633:0 pixel634:0 pixel635:0 pixel636:0.0196078431372549 pixel637:0.807843137254902 pixel638:0.415686274509804 pixel639:0 pixel640:0 pixel641:0 pixel642:0 pixel643:0 pixel644:0 pixel645:0 pixel646:0 pixel647:0 pixel648:0 pixel649:0 pixel650:0 pixel651:0 pixel652:0 pixel653:0 pixel654:0 pixel655:0 pixel656:0 pixel657:0 pixel658:0 pixel659:0 pixel660:0 pixel661:0 pixel662:0 pixel663:0 pixel664:0 pixel665:0.729411764705882 pixel666:0.623529411764706 pixel667:0 pixel668:0 pixel669:0 pixel670:0 pixel671:0 pixel672:0 pixel673:0 pixel674:0 pixel675:0 pixel676:0 pixel677:0 pixel678:0 pixel679:0 pixel680:0 pixel681:0 pixel682:0 pixel683:0 pixel684:0 pixel685:0 pixel686:0 pixel687:0 pixel688:0 pixel689:0 pixel690:0 pixel691:0 pixel692:0.0235294117647059 pixel693:0.819607843137255 pixel694:0.396078431372549 pixel695:0 pixel696:0 pixel697:0 pixel698:0 pixel699:0 pixel700:0 pixel701:0 pixel702:0 pixel703:0 pixel704:0 pixel705:0 pixel706:0 pixel707:0 pixel708:0 pixel709:0 pixel710:0 pixel711:0 pixel712:0 pixel713:0 pixel714:0 pixel715:0 pixel716:0 pixel717:0 pixel718:0 pixel719:0 pixel720:0 pixel721:0 pixel722:0 pixel723:0 pixel724:0 pixel725:0 pixel726:0 pixel727:0 pixel728:0 pixel729:0 pixel730:0 pixel731:0 pixel732:0 pixel733:0 pixel734:0 pixel735:0 pixel736:0 pixel737:0 pixel738:0 pixel739:0 pixel740:0 pixel741:0 pixel742:0 pixel743:0 pixel744:0 pixel745:0 pixel746:0 pixel747:0 pixel748:0 pixel749:0 pixel750:0 pixel751:0 pixel752:0 pixel753:0 pixel754:0 pixel755:0 pixel756:0 pixel757:0 pixel758:0 pixel759:0 pixel760:0 pixel761:0 pixel762:0 pixel763:0 pixel764:0 pixel765:0 pixel766:0 pixel767:0 pixel768:0 pixel769:0 pixel770:0 pixel771:0 pixel772:0 pixel773:0 pixel774:0 pixel775:0 pixel776:0 pixel777:0 pixel778:0 pixel779:0 pixel780:0 pixel781:0 pixel782:0 pixel783:0 pixel784:0
10 |image pixel1:0 pixel2:0 pixel3:0 pixel4:0 pixel5:0 pixel6:0 pixel7:0 pixel8:0 pixel9:0 pixel10:0 pixel11:0 pixel12:0 pixel13:0 pixel14:0 pixel15:0 pixel16:0 pixel17:0 pixel18:0 pixel19:0 pixel20:0 pixel21:0 pixel22:0 pixel23:0 pixel24:0 pixel25:0 pixel26:0 pixel27:0 pixel28:0 pixel29:0 pixel30:0 pixel31:0 pixel32:0 pixel33:0 pixel34:0 pixel35:0 pixel36:0 pixel37:0 pixel38:0 pixel39:0 pixel40:0 pixel41:0 pixel42:0 pixel43:0 pixel44:0 pixel45:0 pixel46:0 pixel47:0 pixel48:0 pixel49:0 pixel50:0 pixel51:0 pixel52:0 pixel53:0 pixel54:0 pixel55:0 pixel56:0 pixel57:0 pixel58:0 pixel59:0 pixel60:0 pixel61:0 pixel62:0 pixel63:0 pixel64:0 pixel65:0 pixel66:0 pixel67:0 pixel68:0 pixel69:0 pixel70:0 pixel71:0 pixel72:0 pixel73:0 pixel74:0 pixel75:0 pixel76:0 pixel77:0 pixel78:0 pixel79:0 pixel80:0 pixel81:0 pixel82:0 pixel83:0 pixel84:0 pixel85:0 pixel86:0 pixel87:0 pixel88:0 pixel89:0 pixel90:0 pixel91:0 pixel92:0 pixel93:0 pixel94:0 pixel95:0 pixel96:0 pixel97:0 pixel98:0 pixel99:0 pixel100:0 pixel101:0 pixel102:0 pixel103:0 pixel104:0 pixel105:0 pixel106:0 pixel107:0 pixel108:0 pixel109:0 pixel110:0 pixel111:0 pixel112:0 pixel113:0 pixel114:0 pixel115:0 pixel116:0 pixel117:0 pixel118:0 pixel119:0 pixel120:0 pixel121:0 pixel122:0.00392156862745098 pixel123:0.0980392156862745 pixel124:0.509803921568627 pixel125:0.607843137254902 pixel126:0.996078431372549 pixel127:0.996078431372549 pixel128:0.996078431372549 pixel129:0.615686274509804 pixel130:0.117647058823529 pixel131:0.00784313725490196 pixel132:0 pixel133:0 pixel134:0 pixel135:0 pixel136:0 pixel137:0 pixel138:0 pixel139:0 pixel140:0 pixel141:0 pixel142:0 pixel143:0 pixel144:0 pixel145:0 pixel146:0 pixel147:0 pixel148:0 pixel149:0.0313725490196078 pixel150:0.403921568627451 pixel151:0.992156862745098 pixel152:0.992156862745098 pixel153:0.992156862745098 pixel154:0.992156862745098 pixel155:0.992156862745098 pixel156:0.992156862745098 pixel157:0.992156862745098 pixel158:0.992156862745098 pixel159:0.447058823529412 pixel160:0.00784313725490196 pixel161:0 pixel162:0 pixel163:0 pixel164:0 pixel165:0 pixel166:0 pixel167:0 pixel168:0 pixel169:0 pixel170:0 pixel171:0 pixel172:0 pixel173:0 pixel174:0 pixel175:0 pixel176:0.0431372549019608 pixel177:0.815686274509804 pixel178:0.992156862745098 pixel179:0.992156862745098 pixel180:0.992156862745098 pixel181:0.992156862745098 pixel182:0.992156862745098 pixel183:0.992156862745098 pixel184:0.992156862745098 pixel185:0.992156862745098 pixel186:0.992156862745098 pixel187:0.992156862745098 pixel188:0.419607843137255 pixel189:0 pixel190:0 pixel191:0 pixel192:0 pixel193:0 pixel194:0 pixel195:0 pixel196:0 pixel197:0 pixel198:0 pixel199:0 pixel200:0 pixel201:0 pixel202:0 pixel203:0 pixel204:0.12156862745098 pixel205:0.992156862745098 pixel206:0.992156862745098 pixel207:0.992156862745098 pixel208:0.992156862745098 pixel209:0.992156862745098 pixel210:0.992156862745098 pixel211:0.992156862745098 pixel212:0.992156862745098 pixel213:0.992156862745098 pixel214:0.992156862745098 pixel215:0.992156862745098 pixel216:0.843137254901961 pixel217:0.396078431372549 pixel218:0.0117647058823529 pixel219:0 pixel220:0 pixel221:0 pixel222:0 pixel223:0 pixel224:0 pixel225:0 pixel226:0 pixel227:0 pixel228:0 pixel229:0 pixel230:0 pixel231:0.0901960784313725 pixel232:0.823529411764706 pixel233:0.992156862745098 pixel234:0.992156862745098 pixel235:0.992156862745098 pixel236:0.972549019607843 pixel237:0.631372549019608 pixel238:0.870588235294118 pixel239:0.870588235294118 pixel240:0.964705882352941 pixel241:0.992156862745098 pixel242:0.992156862745098 pixel243:0.992156862745098 pixel244:0.992156862745098 pixel245:0.992156862745098 pixel246:0.152941176470588 pixel247:0 pixel248:0 pixel249:0 pixel250:0 pixel251:0 pixel252:0 pixel253:0 pixel254:0 pixel255:0 pixel256:0 pixel257:0 pixel258:0 pixel259:0.533333333333333 pixel260:0.992156862745098 pixel261:0.992156862745098 pixel262:0.992156862745098 pixel263:0.898039215686275 pixel264:0.301960784313725 pixel265:0 pixel266:0 pixel267:0 pixel268:0.274509803921569 pixel269:0.854901960784314 pixel270:0.992156862745098 pixel271:0.992156862745098 pixel272:0.992156862745098 pixel273:0.992156862745098 pixel274:0.843137254901961 pixel275:0.356862745098039 pixel276:0 pixel277:0 pixel278:0 pixel279:0 pixel280:0 pixel281:0 pixel282:0 pixel283:0 pixel284:0 pixel285:0 pixel286:0.0196078431372549 pixel287:0.83921568627451 pixel288:0.992156862745098 pixel289:0.992156862745098 pixel290:0.992156862745098 pixel291:0.764705882352941 pixel292:0 pixel293:0 pixel294:0 pixel295:0 pixel296:0 pixel297:0.407843137254902 pixel298:0.87843137254902 pixel299:0.992156862745098 pixel300:0.992156862745098 pixel301:0.992156862745098 pixel302:0.992156862745098 pixel303:0.843137254901961 pixel304:0.113725490196078 pixel305:0 pixel306:0 pixel307:0 pixel308:0 pixel309:0 pixel310:0 pixel311:0 pixel312:0 pixel313:0 pixel314:0.454901960784314 pixel315:0.992156862745098 pixel316:0.992156862745098 pixel317:0.992156862745098 pixel318:0.968627450980392 pixel319:0.294117647058824 pixel320:0 pixel321:0 pixel322:0 pixel323:0 pixel324:0 pixel325:0 pixel326:0.101960784313725 pixel327:0.784313725490196 pixel328:0.992156862745098 pixel329:0.992156862745098 pixel330:0.992156862745098 pixel331:0.992156862745098 pixel332:0.847058823529412 pixel333:0.0156862745098039 pixel334:0 pixel335:0 pixel336:0 pixel337:0 pixel338:0 pixel339:0 pixel340:0 pixel341:0 pixel342:0.996078431372549 pixel343:0.992156862745098 pixel344:0.992156862745098 pixel345:0.992156862745098 pixel346:0.764705882352941 pixel347:0 pixel348:0 pixel349:0 pixel350:0 pixel351:0 pixel352:0 pixel353:0 pixel354:0 pixel355:0.101960784313725 pixel356:0.784313725490196 pixel357:0.992156862745098 pixel358:0.992156862745098 pixel359:0.992156862745098 pixel360:0.992156862745098 pixel361:0.0196078431372549 pixel362:0 pixel363:0 pixel364:0 pixel365:0 pixel366:0 pixel367:0 pixel368:0 pixel369:0 pixel370:0.996078431372549 pixel371:0.992156862745098 pixel372:0.992156862745098 pixel373:0.992156862745098 pixel374:0.388235294117647 pixel375:0 pixel376:0 pixel377:0 pixel378:0 pixel379:0 pixel380:0 pixel381:0 pixel382:0 pixel383:0 pixel384:0.0980392156862745 pixel385:0.905882352941176 pixel386:0.992156862745098 pixel387:0.992156862745098 pixel388:0.992156862745098 pixel389:0.141176470588235 pixel390:0 pixel391:0 pixel392:0 pixel393:0 pixel394:0 pixel395:0 pixel396:0 pixel397:0 pixel398:0.996078431372549 pixel399:0.992156862745098 pixel400:0.992156862745098 pixel401:0.992156862745098 pixel402:0.388235294117647 pixel403:0 pixel404:0 pixel405:0 pixel406:0 pixel407:0 pixel408:0 pixel409:0 pixel410:0 pixel411:0 pixel412:0 pixel413:0.874509803921569 pixel414:0.992156862745098 pixel415:0.992156862745098 pixel416:0.992156862745098 pixel417:0.505882352941176 pixel418:0 pixel419:0 pixel420:0 pixel421:0 pixel422:0 pixel423:0 pixel424:0 pixel425:0 pixel426:0.996078431372549 pixel427:0.992156862745098 pixel428:0.992156862745098 pixel429:0.992156862745098 pixel430:0.388235294117647 pixel431:0 pixel432:0 pixel433:0 pixel434:0 pixel435:0 pixel436:0 pixel437:0 pixel438:0 pixel439:0 pixel440:0 pixel441:0.498039215686275 pixel442:0.992156862745098 pixel443:0.992156862745098 pixel444:0.992156862745098 pixel445:0.505882352941176 pixel446:0 pixel447:0 pixel448:0 pixel449:0 pixel450:0 pixel451:0 pixel452:0 pixel453:0 pixel454:0.996078431372549 pixel455:0.992156862745098 pixel456:0.992156862745098 pixel457:0.992156862745098 pixel458:0.388235294117647 pixel459:0 pixel460:0 pixel461:0 pixel462:0 pixel463:0 pixel464:0 pixel465:0 pixel466:0 pixel467:0 pixel468:0 pixel469:0.545098039215686 pixel470:0.992156862745098 pixel471:0.992156862745098 pixel472:0.992156862745098 pixel473:0.352941176470588 pixel474:0 pixel475:0 pixel476:0 pixel477:0 pixel478:0 pixel479:0 pixel480:0 pixel481:0 pixel482:0.996078431372549 pixel483:0.992156862745098 pixel484:0.992156862745098 pixel485:0.992156862745098 pixel486:0.388235294117647 pixel487:0 pixel488:0 pixel489:0 pixel490:0 pixel491:0 pixel492:0 pixel493:0 pixel494:0 pixel495:0 pixel496:0.305882352941176 pixel497:0.972549019607843 pixel498:0.992156862745098 pixel499:0.992156862745098 pixel500:0.992156862745098 pixel501:0.0196078431372549 pixel502:0 pixel503:0 pixel504:0 pixel505:0 pixel506:0 pixel507:0 pixel508:0 pixel509:0 pixel510:0.996078431372549 pixel511:0.992156862745098 pixel512:0.992156862745098 pixel513:0.992156862745098 pixel514:0.847058823529412 pixel515:0.133333333333333 pixel516:0 pixel517:0 pixel518:0 pixel519:0 pixel520:0 pixel521:0 pixel522:0 pixel523:0.129411764705882 pixel524:0.596078431372549 pixel525:0.992156862745098 pixel526:0.992156862745098 pixel527:0.992156862745098 pixel528:0.419607843137255 pixel529:0.00392156862745098 pixel530:0 pixel531:0 pixel532:0 pixel533:0 pixel534:0 pixel535:0 pixel536:0 pixel537:0 pixel538:0.807843137254902 pixel539:0.992156862745098 pixel540:0.992156862745098 pixel541:0.992156862745098 pixel542:0.992156862745098 pixel543:0.549019607843137 pixel544:0 pixel545:0 pixel546:0 pixel547:0 pixel548:0 pixel549:0.117647058823529 pixel550:0.545098039215686 pixel551:0.917647058823529 pixel552:0.992156862745098 pixel553:0.992156862745098 pixel554:0.992156862745098 pixel555:0.603921568627451 pixel556:0.00784313725490196 pixel557:0 pixel558:0 pixel559:0 pixel560:0 pixel561:0 pixel562:0 pixel563:0 pixel564:0 pixel565:0 pixel566:0.0627450980392157 pixel567:0.803921568627451 pixel568:0.992156862745098 pixel569:0.992156862745098 pixel570:0.992156862745098 pixel571:0.980392156862745 pixel572:0.815686274509804 pixel573:0.415686274509804 pixel574:0.415686274509804 pixel575:0.415686274509804 pixel576:0.784313725490196 pixel577:0.929411764705882 pixel578:0.992156862745098 pixel579:0.992156862745098 pixel580:0.992156862745098 pixel581:0.992156862745098 pixel582:0.819607843137255 pixel583:0.0862745098039216 pixel584:0 pixel585:0 pixel586:0 pixel587:0 pixel588:0 pixel589:0 pixel590:0 pixel591:0 pixel592:0 pixel593:0 pixel594:0 pixel595:0.32156862745098 pixel596:0.992156862745098 pixel597:0.992156862745098 pixel598:0.992156862745098 pixel599:0.992156862745098 pixel600:0.992156862745098 pixel601:0.992156862745098 pixel602:0.992156862745098 pixel603:0.992156862745098 pixel604:0.992156862745098 pixel605:0.992156862745098 pixel606:0.992156862745098 pixel607:0.992156862745098 pixel608:0.992156862745098 pixel609:0.819607843137255 pixel610:0.0862745098039216 pixel611:0 pixel612:0 pixel613:0 pixel614:0 pixel615:0 pixel616:0 pixel617:0 pixel618:0 pixel619:0 pixel620:0 pixel621:0 pixel622:0 pixel623:0.00392156862745098 pixel624:0.356862745098039 pixel625:0.992156862745098 pixel626:0.992156862745098 pixel627:0.992156862745098 pixel628:0.992156862745098 pixel629:0.992156862745098 pixel630:0.992156862745098 pixel631:0.992156862745098 pixel632:0.992156862745098 pixel633:0.992156862745098 pixel634:0.992156862745098 pixel635:0.835294117647059 pixel636:0.352941176470588 pixel637:0.0274509803921569 pixel638:0 pixel639:0 pixel640:0 pixel641:0 pixel642:0 pixel643:0 pixel644:0 pixel645:0 pixel646:0 pixel647:0 pixel648:0 pixel649:0 pixel650:0 pixel651:0 pixel652:0.00392156862745098 pixel653:0.0705882352941176 pixel654:0.505882352941176 pixel655:0.815686274509804 pixel656:0.992156862745098 pixel657:0.992156862745098 pixel658:0.992156862745098 pixel659:0.992156862745098 pixel660:0.623529411764706 pixel661:0.505882352941176 pixel662:0.352941176470588 pixel663:0.0156862745098039 pixel664:0 pixel665:0 pixel666:0 pixel667:0 pixel668:0 pixel669:0 pixel670:0 pixel671:0 pixel672:0 pixel673:0 pixel674:0 pixel675:0 pixel676:0 pixel677:0 pixel678:0 pixel679:0 pixel680:0 pixel681:0 pixel682:0 pixel683:0 pixel684:0 pixel685:0 pixel686:0 pixel687:0 pixel688:0 pixel689:0 pixel690:0 pixel691:0 pixel692:0 pixel693:0 pixel694:0 pixel695:0 pixel696:0 pixel697:0 pixel698:0 pixel699:0 pixel700:0 pixel701:0 pixel702:0 pixel703:0 pixel704:0 pixel705:0 pixel706:0 pixel707:0 pixel708:0 pixel709:0 pixel710:0 pixel711:0 pixel712:0 pixel713:0 pixel714:0 pixel715:0 pixel716:0 pixel717:0 pixel718:0 pixel719:0 pixel720:0 pixel721:0 pixel722:0 pixel723:0 pixel724:0 pixel725:0 pixel726:0 pixel727:0 pixel728:0 pixel729:0 pixel730:0 pixel731:0 pixel732:0 pixel733:0 pixel734:0 pixel735:0 pixel736:0 pixel737:0 pixel738:0 pixel739:0 pixel740:0 pixel741:0 pixel742:0 pixel743:0 pixel744:0 pixel745:0 pixel746:0 pixel747:0 pixel748:0 pixel749:0 pixel750:0 pixel751:0 pixel752:0 pixel753:0 pixel754:0 pixel755:0 pixel756:0 pixel757:0 pixel758:0 pixel759:0 pixel760:0 pixel761:0 pixel762:0 pixel763:0 pixel764:0 pixel765:0 pixel766:0 pixel767:0 pixel768:0 pixel769:0 pixel770:0 pixel771:0 pixel772:0 pixel773:0 pixel774:0 pixel775:0 pixel776:0 pixel777:0 pixel778:0 pixel779:0 pixel780:0 pixel781:0 pixel782:0 pixel783:0 pixel784:0

Format is simple (see below). ‘labelValue’ is one of [1,10]; we write digit 0 as 10. Namespace is any name for a collection of features. In this particular case all features are similar (just pixel intensities) so they get all clubbed under one namespace. You may read about distinction between namespace and features here. I may mention in passing that a namespace is identified by the first letter of its name (‘ i ‘ in our case) rather than the full name (‘image’).


labelValue |namespaceName var1:value1 var2:value2 var3:value3 var4:value4 var5:value5
 

We, therefore, need to convert ‘train.csv’ and ‘test.csv’ files to VW format. Even though, test.csv, does not contain any digit identifier (label), we arbitrarily put digit of 1. While making prediction this label is ignored by Vowpal Wabbit. The conversion script in R is as below. I have taken the code from here and made minor changes.


# Data conversion to VW format
# ============================

# Read train.csv
data<-read.csv("train.csv",header=TRUE)

# Read just label values
y<-data[,1]
# Label 0 be written as label 10 (VW's requirement)
y[y[]==0]<-10

# Scale rest of data between [0,1]
x<-(data[,-1])/255

# Indicies where in row 1, x is not zero.
#&nbsp;&nbsp; Just to see data
which(x[1,]>0)
x[,which(x[1,]>0)]

# Function to convert csv file VW format
write_vwformat = function( filename, y, x ) {
     # Open file where to write output to
     f = file( filename, 'w' )
     # loop over all rows
     for ( i in 1:nrow( x )) {
         # How many columns are there.
         #&nbsp;&nbsp; Generate a vector of all column index numbers
         indexes = 1:ncol(x)
         # What values exist at those index numbers
         #&nbsp;&nbsp; Store all values in a vector
         values = x[i, indexes]
         # Create a prefix for all column names: pixel1, pixel2,...
         prefix = paste("pixel", indexes, sep="")
         # Concatenate to each prefix, column values.pixel1:0 pixel2:0.987..." )
         iv_pairs = paste( prefix, values, sep = ":", collapse = " " )
         # add label in the front and newline at the end
         output_line = paste( y[i], " |image ", iv_pairs, "\n", sep = "" )
         # write to file
         cat( output_line, file = f )
         # print progress
         if ( i %% 1000 == 0 ) {
            print( i )
            }
      }
# Close the connection
close( f )
}
# Invoke the function to convert
write_vwformat("train.vw",y,x)

Once ‘train.vw’ file is available, modeling can begin. We have used One-against-all process. You may please refer here for an example of its usage. The model building code (first line) and its output (rest of lines) are as below:

$ vw -d train.vw --cache_file mnist_cache  --passes 25  --oaa 10  -f mnist.model -b20  -q ii
 
creating quadratic features for pairs: ii 
final_regressor = mnist.model
Num weight bits = 20
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
can't open: mnist_cache, error = No such file or directory
creating cache_file = mnist_cache
Reading datafile = train.vw
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.000000   0.000000          1      1.0          1        1     9507
0.500000   1.000000          2      2.0         10        1    60271
0.750000   1.000000          4      4.0          4        1    12883
0.875000   1.000000          8      8.0          3        7    31863
0.812500   0.750000         16     16.0          2        1    44733
0.812500   0.812500         32     32.0          2        3    22953
0.671875   0.531250         64     64.0          3        3    28731
0.523438   0.375000        128    128.0         10       10    62251
0.417969   0.312500        256    256.0          9        9    16003
0.289062   0.160156        512    512.0          2        2    14521
0.221680   0.154297       1024   1024.0          8        8    37057
0.162598   0.103516       2048   2048.0          5        5    23871
0.119873   0.077148       4096   4096.0          4        4    13573
0.094482   0.069092       8192   8192.0          4        4     9313
0.072754   0.051025      16384  16384.0          8        8    19741
0.056641   0.040527      32768  32768.0          1        1     6481
0.044087   0.044087      65536  65536.0          6        6    19741 h
0.034952   0.025817     131072 131072.0          5        5     9121 h
0.030728   0.026504     262144 262144.0          9        9    26733 h
0.027157   0.023586     524288 524288.0          2        2    35533 h

finished run
number of examples per pass = 37800
passes used = 15
weighted example sum = 567000
weighted label sum = 0
average loss = 0.0230952 h
total feature number = 13850936880
  

An explanation of arguments to vw is in order. Flag --cache_file mnist_cache first converts train.vw to a binary file for future faster processing. Next time, we go through the model building again, this cache file and not the train.vw file will be read (if by and large arguments to vw remain the same). Argument --passes 25 is the number of passes and --oaa 10 refers to oaa learning algorithm with 10 classes (1 to 10). -q ii creates interaction between variables in the two referred to namespaces which here are the same i.e. ‘image’ namespace. An interaction variable is created from two variables ‘A’ and ‘B’ by multiplying the values of ‘A’ and ‘B’. It is a general technique and you may read more about it in Wikipedia. -f mnist.model refers to file where model will be saved. -b 22 refers to number of bits in the feature table. Default number is 18 but as we have increased the number of features much more by introducing interaction features, value of ‘-b’ has been increased to 22. You may read more about VW command line arguments here. Model construction can be done on an 8GB machine and does not consume much time or RAM.

Once model is ready, we need to convert ‘test.csv’ to ‘test.vw’ with an arbitrary label value. The code for conversion is as below.

# For comments read earlier R code
# ============================

# Read test data and scale it
data<-read.csv("test.csv",header=TRUE)
x<-(data[,])/255

# Function to convert csv file VW format 
write_vwformat = function( filename, y, x ) {

	# open file where to write output to
	f = file( filename, 'w' )

	# loop over all rows
	for ( i in 1:nrow( x )) {
		indexes = 1:ncol(x)
		values = x[i, indexes]
		prefix = paste("pixel", indexes, sep="")
		iv_pairs = paste( prefix, values, sep = ":", collapse = " " )
		output_line = paste( "1 |image ", iv_pairs, "\n", sep = "" )
		cat( output_line, file = f )
		if ( i %% 1000 == 0 ) {
			print( i )
		}
	}
	close( f )
}
write_vwformat("test.vw",y,x)
  

Let us now make the prediction for each image listed in test.vw. The vw prediction command is as below.


$ vw -t -i mnist.model test.vw -p predict.txt
   

Flag -t is for test file; -i specifies the model file created earlier and class predictions [1,10] are saved to ‘predict.txt’ file. Kaggle requires submission in a particular format. R code for this is below. While doing so, we reconvert 10 back to 0.

data<-read.csv("predict.txt",header=FALSE)
# Reconvert 10 to 0
data[data[]==10]<-0
# Create a dataframe with one column
#  of IDs 1 to 28000
cpy<-as.data.frame(1:28000)
# Create a second column of predicted digits 
cpy[2]<-data
# Assign column names
colnames(cpy)<-c("ImageId","Label")
# Write to file for submission
write.csv(cpy,file="submit.csv",row.names=F,quote=F)
  

File ‘submit.csv’ can be submitted to Kaggle. The score is 0.97943.

Kaggle Accuracy Score

Kaggle Accuracy score

Performance Evaluation

One can attempt performance evaluation with a test dataset when digits that images represent in this set are known. From our ‘train.csv’, we sample 10% of records (without replacement). There are many ways this can be done. For example, sample() function in R can be used to sample data from ‘train.csv’ and store the sampled records to a file. We have used bash script to extract 4200 lines from ‘train.csv’ without replacement. The bash script is below (it does take time, but is in the spirit of Vowpal Wabbit in that the complete file is not read to RAM beforehand):

#!/bin/bash
#
# Generate cross-validation file of specified lines (size) with replacement
# Creates two files: training and tovalidate.
# Original file remains undisturbed.
# Usage: ./cross_validate.sh (no arguments: Specify two below)

####User to fill in following two constants##########
datadir="Documents/kaggle"
originalfile="train.csv"
# Get home folder
cd ~
homefolder=`pwd`
# your data folder
datadir=$homefolder/$datadir
echo "Your data directory is $datadir"
echo "Your original data file is $originalfile"
echo -n "Press Enter to continue or ctrl+c to terminate"; read x
#########Begin#############
trainfile="training"
samplefile="tovalidate"

cd $datadir
# Delete sample file, if it exists, & recreate it
echo "Deleting $samplefile file"
rm -f $datadir/$samplefile
touch $datadir/$samplefile

# Delete training file,if it exists, and recreate copy of originalfile
echo "Deleting $trainfile file"
rm -f $datadir/$trainfile

echo "Creating $trainfile file"
cp $datadir/$originalfile $datadir/$trainfile
echo "Created file $trainfile"

# Delete temp file, if it exists
rm -f temp.txt

# Number of lines in given file
nooflines=`sed -n '$=' $datadir/$trainfile`

echo "No of lines in $datadir/$originalfile  are: $nooflines"
echo -n "Your sample size (recommended 10% of orig file)? " ; read samplesize

# Default is 100
if [ -z $samplesize ] ; then
	echo "Default value of sample size = 100"
	samplesize=100
fi

# Bash loop to generate random numbers and test file
echo "Will generate $samplesize random numbers"
echo "Original file size is $nooflines lines"
echo "Wait...it takes time...."
for (( i = 1 ; i <= $samplesize ; ++i ));
	do 
		arr[i]=$(( ( RANDOM % $nooflines )  + 2 ));
		lineno="${arr[i]}"
		# Append lines to sample file
		sed -n "${lineno}p" $datadir/$trainfile >> $datadir/$samplefile
		# Delete the same line from $trainfile
		sed "${lineno}d" $datadir/$trainfile > temp.txt
		mv temp.txt $datadir/$trainfile
  	        # Print number of lines appended in multiples of 50
		a=$(( ( $i % 50 )  + 2 ));
		if [ $a == "2" ] ; then
			echo "Lines appended: $i"; 
		fi
	done

trlines=`sed -n '$=' $datadir/$trainfile`
samplines=`sed -n '$=' $datadir/$samplefile`

# Delete temp file
rm -f temp.txt

echo "---------------------------------"	
echo "Lines in sample file $samplefile: $samplines"
echo "Lines in training file $trainfile : $trlines" ;
echo "Data folder: $datadir"
echo "---------------------------------"
##############END######################
  

We convert files ‘training.csv’ and ‘testing.csv’ to VW format as before. We then run the following simple vw commands first to create model then to make predictions.

# Create model
vw -d training.vw --cache_file t_cache  --passes 25  --oaa 10  -f t_model
# Make predictions
vw -t -i t_model tovalidate.vw -p p.txt
  

We now have the predicted digits (in file ‘p.txt’) and actual digits (first column of ‘tovalidate.vw’). We compare the two and prepare i) Confusion matrix and ii) Classification report. Unfortunately, unlike for two class predictions, in multiclass cases not many tools are available to prepare accuracy reports. We will use sklearn.metrics (python) module. The code is very simple. Just read the data and print reports.

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import pandas as pd

# Read tovalidate.vw and p.txt files
obs=pd.read_table ("/home/ashokharnal/Documents/kaggle/tovalidate.vw",sep=" ",header=None)
pred=pd.read_csv ("/home/ashokharnal/Documents/kaggle/p.txt",sep=" ",header=None)

# Print confusion matrix
confusion_matrix(obs.iloc[:,0],pred)

array([[486,   2,   2,   1,   0,   1,   1,   4,   0,   0],
       [  5, 375,  12,   6,   2,   3,  13,   9,   3,   1],
       [  2,   8, 383,   0,  17,   2,   1,   7,   4,   0],
       [  1,   1,   1, 405,   1,   3,   0,   3,  13,   1],
       [  1,   3,  23,   8, 297,   8,   2,  13,   6,   8],
       [  1,   3,   0,   5,   9, 366,   0,   2,   0,   3],
       [  4,   4,   5,  11,   4,   0, 419,   1,  17,   3],
       [  3,   6,  21,   1,   9,   1,   5, 343,   7,   2],
       [  0,   1,  14,  12,   0,   0,  11,   5, 389,   0],
       [  0,   0,   1,   0,   3,   1,   0,   1,   1, 357]])

target_names = ['class a', 'class b', 'class c', 'class d','class e','class f','class g','class h','class j','class k']
# Print classification report
print(classification_report(obs.iloc[:,0], pred, target_names=target_names))

             precision    recall  f1-score   support

    class a       0.97      0.98      0.97       497
    class b       0.93      0.87      0.90       429
    class c       0.83      0.90      0.86       424
    class d       0.90      0.94      0.92       429
    class e       0.87      0.80      0.84       369
    class f       0.95      0.94      0.95       389
    class g       0.93      0.90      0.91       468
    class h       0.88      0.86      0.87       398
    class j       0.88      0.90      0.89       432
    class k       0.95      0.98      0.97       364

avg / total       0.91      0.91      0.91      4199

# The following additional code displays confusion matrix visually (see image below)

%pylab
import matplotlib.pyplot as plt
# Show confusion matrix in a separate window
plt.matshow(cam)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
  
Confusion Matrix

Confusion Matrix

This finishes our experiment on Handwritten Digit Recognition.

Click Through Rate prediction using neural network in Vowpal Wabbit: A to Z

March 3, 2015

Click through rate (CTR) is an accepted metric for judging the success of an online advertising campaign. As a huge sum of money is spent in online advertisements, advertisers want to learn which advertisements are likely to be successful and which not. A number of machine learning techniques are used in the process.

Vowpal Wabbit is a very fast machine learning system. It is a bundle of a number of machine learning algorithms with very high  predictive accuracy. On many a Kaggle competition where data is complex or is sizeable or number of features are many, Vowpal Wabbit is used by some of the participants. However, unlike R or python machine learning environments, Vowpal Wabbit does not, as yet, have any data exploration capability though a utility wrapper does exist to give one some idea of data. One must already know one’s data before one begins to apply a machine learning algorithm using Vowpal Wabbit. Kaggle recently hosted a competition for predicting click through rate of online advertisements. Competition is on behalf of Avazu, who placed its 11 days of click through data on the site. 10 days of this data constitutes ‘train.csv’ and one day’s data is in file ‘test.csv’. My score after going through model building was 0.3976030. Plenty of scope for model improvements exists.

File, ‘train.csv’, is around 5.9gb and ‘test.csv’ is around 674mb. Fields in file ‘train.csv’ are as below:


All variables except 'hour' are categorical variables

id: ad identifier
click: 0/1 for non-click/click
hour: format is YYMMDDHH
C1: anonymized categorical variable
banner_pos
site_id
site_domain
site_category
app_id
app_domain
app_category
device_id
device_ip
device_model
device_type
device_conn_type
C14-C21--anonymized categorical variables

test.csv, has all fields but the ‘click’ field which is not disclosed to us. The job is to train a machine learning model on ‘train.csv’ so as to predict ‘click’ for each online advertisement listed in ‘test.csv’. We can find total number of lines in both files and observe first few lines as:


# count lines in train.csv and test.csv
$ wc -l train.csv
40428968 train.csv
$ wc -l test.csv
4577465 test.csv
# show first five lines in train.csv and in test.csv
$ head -5 train.csv
id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,device_ip,device_model,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
1000009418151094273,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,ddd2926e,44956a24,1,2,15706,320,50,1722,0,35,-1,79
10000169349117863715,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,96809ac8,711ee120,1,0,15704,320,50,1722,0,35,100084,79
10000371904215119486,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,b3cf8def,8a4875bd,1,0,15704,320,50,1722,0,35,100084,79
10000640724480838376,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,e8275b8f,6332421a,1,0,15706,320,50,1722,0,35,100084,79
$ head -5 test.csv
id,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,device_ip,device_model,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
10000174058809263569,14103100,1005,0,235ba823,f6ebf28e,f028772b,ecad2386,7801e8d9,07d7df22,a99f214a,69f45779,0eb711ec,1,0,8330,320,50,761,3,175,100075,23
10000182526920855428,14103100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,e8d44657,ecb851b2,1,0,22676,320,50,2616,0,35,100083,51
10000554139829213984,14103100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,10fb085b,1f0bc64f,1,0,22676,320,50,2616,0,35,100083,51
10001094637809798845,14103100,1005,0,85f751fd,c4e18dd6,50e219e0,51cedd4e,aefc06bd,0f2161f8,a99f214a,422d257a,542422a7,1,0,18648,320,50,1092,3,809,100156,61

To observe training file structure we read the data in R. Reading data in R implies loading the whole 5.9gb of file in RAM. The operating system, however, reports (check with command: $cat /proc/meminfo ) that RAM actually occupied is much more than 5.9gb. And with 8gb of total RAM in my machine it is not possible to do any modelling in R (in fact even 16gb RAM becomes insufficient). Hence, Vowpal Wabbit. Installation instructions on CentOS are here and general instructions here. After reading data in R, observe its structure. It is as below:


# We read the hour data as numeric and most of rest as 'factor'
>data<--read.csv("train.csv",header=TRUE,colClasses=c('character','factor','numeric',rep('factor',21)))
>str(data)
'data.frame':    40428967 obs. of  24 variables:
$ id              : chr  "1000009418151094273" "10000169349117863715" "10000371904215119486"  "10000640724480838376" ...
$ click           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
$ hour            : num  14102100 14102100 14102100 14102100 14102100 ...
$ C1              : Factor w/ 7 levels "1001","1002",..: 3 3 3 3 3 3 3 3 3 2 ...
$ banner_pos      : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 2 1 1 2 1 1 ...
$ site_id         : Factor w/ 4737 levels "000aa1a4","00255fb4",..: 583 583 583 583 4696 3949 2679 4159 583 2475 ...
$ site_domain     : Factor w/ 7745 levels "000129ff","0035f25a",..: 7340 7340 7340 7340 4457 5718 1194 3894 7340 6001 ...
$ site_category   : Factor w/ 26 levels "0569f928","110ab22d",..: 3 3 3 3 1 25 25 25 3 7 ...
$ app_id          : Factor w/ 8552 levels "000d6291","000f21f1",..: 7885 7885 7885 7885 7885 7885 7885 7885 7885 7885 ...
$ app_domain      : Factor w/ 559 levels "001b87ae","002e4064",..: 255 255 255 255 255 255 255 255 255 255 ...
$ app_category    : Factor w/ 36 levels "07d7df22","09481d60",..: 1 1 1 1 1 1 1 1 1 1 ...
$ device_id       : Factor w/ 2686408 levels "00000414","00000715",..: 1780273 1780273 1780273 1780273 1780273 1780273 1780273 1780273 1780273 2049924 ...
$ device_ip       : Factor w/ 6729486 levels "0000016d","00000262",..: 5831891 3958536 4728526 6103614 3952384 134645 4690993 6072203 1470150 6353522 ...
$ device_model    : Factor w/ 8251 levels "00097428","0009f4d7",..: 2188 3613 4461 3164 3836 4461 6079 6080 2997 1751 ...
$ device_type     : Factor w/ 5 levels "0","1","2","4",..: 2 2 2 2 2 2 2 2 2 1 ...
$ device_conn_type: Factor w/ 4 levels "0","2","3","5": 2 1 1 1 1 1 1 1 2 1 ...
$ C14             : Factor w/ 2626 levels "10289","1037",..: 188 186 186 188 493 239 642 680 189 920 ...
$ C15             : Factor w/ 8 levels "1024","120","216",..: 5 5 5 5 5 5 5 5 5 5 ...
$ C16             : Factor w/ 9 levels "1024","20","250",..: 7 7 7 7 7 7 7 7 7 7 ...
$ C17             : Factor w/ 435 levels "1008","1042",..: 31 31 31 31 84 51 126 136 31 187 ...
$ C18             : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 4 1 4 ...
$ C19             : Factor w/ 68 levels "1059","1063",..: 33 33 33 33 33 43 35 35 33 15 ...
$ C20             : Factor w/ 172 levels "-1","100000",..: 1 66 66 66 1 60 1 1 1 146 ...
$ C21             : Factor w/ 60 levels "1","100","101",..: 53 53 53 53 16 11 16 33 53 33 ...
>
>table(data$click)

0        1
33563901  6865066

Note from output of ‘table’ command above that clicks are about 20% of non-clicks. We have taken a summary of data attributes in R. It is as below:

> summary(data)
id                 click             hour             C1
Length:40428967    0:33563901   Min.   :14102100   1001:    9463
Class :character   1: 6865066   1st Qu.:14102304   1002: 2220812
Mode  :character                Median :14102602   1005:37140632
Mean   :14102558   1007:   35304
3rd Qu.:14102814   1008:    5787
Max.   :14103023   1010:  903457
1012:  113512

banner_pos       site_id           site_domain        site_category
0:29109590   85f751fd:14596137   c4e18dd6:15131739   50e219e0:16537234
1:11247282   1fbe01fe: 6486150   f3845767: 6486150   f028772b:12657073
2:   13001   e151e245: 2637747   7e091613: 3325008   28905ebd: 7377208
3:    2035   d9750ee7:  963745   7687a86e: 1290165   3e814130: 3050306
4:    7704   5b08c53b:  913325   98572c79:  996816   f66779e6:  252451
5:    5778   5b4d2eda:  771360   16a36ef3:  855686   75fa27f6:  160985
7:   43577   (Other) :14060503   (Other) :12343403   (Other) :  393710

app_id               app_domain         app_category
ecad2386:25832830   7801e8d9:27237087   07d7df22:26165592
92f5800b: 1555283   2347f47a: 5240885   0f2161f8: 9561058
e2fcccd2: 1129016   ae637522: 1881838   cef3e649: 1731545
febd1138:  759098   5c5a694b: 1129228   8ded1f7a: 1467257
9c13b419:  757812   82e27996:  759125   f95efa07: 1141673
7358e05e:  615635   d9b5648e:  713924   d1327cf5:  123233
(Other) : 9779293   (Other) : 3466880   (Other) :  238609

device_id           device_ip          device_model      device_type
a99f214a:33358308   6b9769f2:  208701   8a4875bd: 2455470   0: 2220812
0f7c61dc:   21356   431b3174:  135322   1f0bc64f: 1424546   1:37304667
c357dbff:   19667   2f323f36:   88499   d787e91b: 1405169   2:      31
936e92fb:   13712   af9205f9:   87844   76dc4769:  767961   4:  774272
afeffc18:    9654   930ec31d:   86996   be6db1d7:  742913   5:  129185
987552d1:    4187   af62faf4:   85802   a0f5f879:  652751
(Other) : 7002083   (Other) :39735803   (Other) :32980157

device_conn_type      C14                C15                C16
0:34886838       4687   :  948215   320    :37708959   50     :38136554
2: 3317443       21611  :  907004   300    : 2337294   250    : 1806334
3: 2181796       21189  :  765968   216    :  298794   36     :  298794
5:   42890       21191  :  765092   728    :   74533   480    :  103365
19771  :  730238   120    :    3069   90     :   74533
19772  :  729305   1024   :    2560   20     :    3069
(Other):35583145   (Other):    3758   (Other):    6318

C17           C18               C19                C20
1722   : 4513492   0:16939044   35     :12170630   -1     :18937918
2424   : 1531071   1: 2719623   39     : 8829426   100084 : 2438478
2227   : 1473105   2: 7116058   167    : 3145695   100148 : 1794890
1800   : 1190161   3:13654242   161    : 1587765   100111 : 1716733
423    :  948215                47     : 1451708   100077 : 1575495
2480   :  918663                1327   : 1092601   100075 : 1546414
(Other):29854260                (Other):12151142   (Other):12419039

C21
23     : 8896205
221    : 5051245
79     : 4614799
48     : 2160794
71     : 2108496
61     : 2053636
(Other):15543792

From the data structure and its summary it can be seen that certain categorical variables such as device_id and device_ip have too many levels as to be really meaningful as a factor. Some others such as device_model and site_id also have a relatively large number of categorical levels. In this blog I have not taken these into account but in a closer analysis it may be worthwhile to examine that if either some levels can be clubbed together or maybe that particular feature ignored altogether as merely being another id.

Vowpal Wabbit requires input data to be in certain format. It is not csv format. Its formatting instructions are here. You may further benefit from the detailed explanations regarding input format at this Stack Overflow link. You may also like to go through the clarifications regarding difference between ‘namespace’ and ‘feature’ at this link.

We will format input as below. While formatting ‘id’ field is ignored being of no importance. Note also that value of click in train.csv is either 0 or 1.


==> train.csv <==
id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,device_ip,device_model,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
1000009418151094273,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,ddd2926e,44956a24,1,2,15706,320,50,1722,0,35,-1,79
10000169349117863715,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,96809ac8,711ee120,1,0,15704,320,50,1722,0,35,100084,79
10000371904215119486,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,b3cf8def,8a4875bd,1,0,15704,320,50,1722,0,35,100084,79
10000640724480838376,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,e8275b8f,6332421a,1,0,15706,320,50,1722,0,35,100084,79
10000679056417042096,0,14102100,1005,1,fe8cc448,9166c161,0569f928,ecad2386,7801e8d9,07d7df22,a99f214a,9644d0bf,779d90c2,1,0,18993,320,50,2161,0,35,-1,157
10000720757801103869,0,14102100,1005,0,d6137915,bb1ef334,f028772b,ecad2386,7801e8d9,07d7df22,a99f214a,05241af0,8a4875bd,1,0,16920,320,50,1899,0,431,100077,117
10000724729988544911,0,14102100,1005,0,8fda644b,25d4cfcd,f028772b,ecad2386,7801e8d9,07d7df22,a99f214a,b264c159,be6db1d7,1,0,20362,320,50,2333,0,39,-1,157
10000918755742328737,0,14102100,1005,1,e151e245,7e091613,f028772b,ecad2386,7801e8d9,07d7df22,a99f214a,e6f67278,be74e6fe,1,0,20632,320,50,2374,3,39,-1,23
10000949271186029916,1,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,37e8da74,5db079b5,1,2,15707,320,50,1722,0,35,-1,79

==> train.vw <==
-1 |fe c1_1005 banner_pos_0 |site site_id_1fbe01fe site_domain_f3845767 site_category_28905ebd |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_ddd2926e device_model_44956a24 device_type_1 device_conn_type_2 |others c14_15706 c15_320 c16_50 c17_1722 c18_0 c19_35 c20_-1 c21_79
-1 |fe c1_1005 banner_pos_0 |site site_id_1fbe01fe site_domain_f3845767 site_category_28905ebd |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_96809ac8 device_model_711ee120 device_type_1 device_conn_type_0 |others c14_15704 c15_320 c16_50 c17_1722 c18_0 c19_35 c20_100084 c21_79
-1 |fe c1_1005 banner_pos_0 |site site_id_1fbe01fe site_domain_f3845767 site_category_28905ebd |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_b3cf8def device_model_8a4875bd device_type_1 device_conn_type_0 |others c14_15704 c15_320 c16_50 c17_1722 c18_0 c19_35 c20_100084 c21_79
-1 |fe c1_1005 banner_pos_0 |site site_id_1fbe01fe site_domain_f3845767 site_category_28905ebd |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_e8275b8f device_model_6332421a device_type_1 device_conn_type_0 |others c14_15706 c15_320 c16_50 c17_1722 c18_0 c19_35 c20_100084 c21_79
-1 |fe c1_1005 banner_pos_1 |site site_id_fe8cc448 site_domain_9166c161 site_category_0569f928 |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_9644d0bf device_model_779d90c2 device_type_1 device_conn_type_0 |others c14_18993 c15_320 c16_50 c17_2161 c18_0 c19_35 c20_-1 c21_157
-1 |fe c1_1005 banner_pos_0 |site site_id_d6137915 site_domain_bb1ef334 site_category_f028772b |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_05241af0 device_model_8a4875bd device_type_1 device_conn_type_0 |others c14_16920 c15_320 c16_50 c17_1899 c18_0 c19_431 c20_100077 c21_117
-1 |fe c1_1005 banner_pos_0 |site site_id_8fda644b site_domain_25d4cfcd site_category_f028772b |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_b264c159 device_model_be6db1d7 device_type_1 device_conn_type_0 |others c14_20362 c15_320 c16_50 c17_2333 c18_0 c19_39 c20_-1 c21_157
-1 |fe c1_1005 banner_pos_1 |site site_id_e151e245 site_domain_7e091613 site_category_f028772b |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_e6f67278 device_model_be74e6fe device_type_1 device_conn_type_0 |others c14_20632 c15_320 c16_50 c17_2374 c18_3 c19_39 c20_-1 c21_23
1 2 |fe c1_1005 banner_pos_0 |site site_id_1fbe01fe site_domain_f3845767 site_category_28905ebd |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_37e8da74 device_model_5db079b5 device_type_1 device_conn_type_2 |others c14_15707 c15_320 c16_50 c17_1722 c18_0 c19_35 c20_-1 c21_79
-1 |fe c1_1002 banner_pos_0 |site site_id_84c7ba46 site_domain_c4e18dd6 site_category_50e219e0 |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_c357dbff device_ip_f1ac7184 device_model_373ecbe6 device_type_0 device_conn_type_0 |others c14_21689 c15_320 c16_50 c17_2496 c18_3 c19_167 c20_100191 c21_23

What we have done above is this: We created five namespaces: fe, site, app, device and others. Initial two fields have been bracketed with ‘fe’ namespace (‘fe’ is an arbitrary name). And site related fields with ‘site’ namespace and so on. Fields about which we are not clear (names being anonymous) are under ‘others’ namespace.


Namespace	Fields (prefix to value) in namespace

|fe             c1_             banner_pos_
|site           site_id_        site_domain_    site_category_
|app            app_id_         app_domain_     app_category_
|device         device_id_      device_ip_      device_model_   device_type_   device_conn_type_
|others         c14_            c15_            c16_	c17_    c18_     c19_  c20_  c21_

A namespace name starts with ‘|’. A namespace is known by its initial letter rather than by the complete name. Thus identifier for ‘site’ namespace is ‘s’ rather than ‘site’. Before the first namespace (‘|fe’), we have the value of class (i.e. ‘click’) label. It is -1, if the click is 0 but if the click is 1, it remains 1. As number of clicks (1s) are few, we have attached an ‘Importance’ factor of 2 to each click (line 22 in train.vw above). Latter on in our analyses, we will vary ‘Importance’ and see the effect.

Conversion from csv to Vowpal Wabbit format is easy and can be carried out either using ‘awk’ or python. Code for awk is as below. Header line has been ignored (NR >1) so also the first field (i.e. id or $1):

#! /bin/awk -f
# # Call it as: ./train.awk train.csv > train.vw
# # Check few sample lines in VW Validator: http://hunch.net/~vw/validate.html

BEGIN {FS = "," ; ORS = ""};
		{if (NR > 1 )
			{
	 	 	if ($2 == 0) $2 = "-1" ; else  $2 = "1 2" ;
 			print($2)
 	 		print(" |fe")
			print(" c1_") ; print $4
			print(" banner_pos_") ; print $5
			print(" |site")
			print(" site_id_") ; print $6
			print(" site_domain_") ; print $7 ;
			print(" site_category_") ; print $8 ;
			print(" |app")
			print(" app_id_"); print $9
			print(" app_domain_") ; print $10
			print(" app_category_") ; print($11)
			print(" |device")
			print(" device_id_") ; print($12)
			print(" device_ip_"); print($13)
			print(" device_model_") ; print($14)
			print(" device_type_") ; print($15)
			print(" device_conn_type_") ; print($16)
			print(" |others")
			print(" c14_") ; print($17)
			print(" c15_") ; print($18)
			print(" c16_") ; print($19)
			print(" c17_") ; print($20)
			print(" c18_") ; print($21)
			print(" c19_") ; print($22)
			print(" c20_") ; print($23)
			print(" c21_") ; print($24)
 			print("\n")
 			} }

You can print first few lines of ‘train.vw’ file using command: $head --lines 5 train.vw . The python conversion code is equally simple and I write below:

## Use it as: python convert.py > train.vw
import csv
import re
i = 0
trainfile= open("train.csv", "r")
csv_reader=csv.reader(trainfile)
linenum=0
for row in csv_reader:
	linenum +=1
	# If not header
	if linenum > 1:
		vw_line = ""
		# Check value in column 2. If 0, make it -1
		if (str(row[1])== "0" ):
			# Label in vw_line
			vw_line += "-1 |fe"
		else:
			vw_line += "1 2 |fe"
		dtime_numb=row[2]
		year  = dtime_numb[0:2]
		month = dtime_numb[2:4]
		day   = dtime_numb[4:6]
		hour  = dtime_numb[6:9]
		yeartime = " year:"+year + " month:" + month + " day:" + day +" hour:" + hour
		vw_line += yeartime
		vw_line += " |pos"
		vw_line += str(" c1_")+str(row[3])
		vw_line += str(" banner_pos_")+str(row[4])
		vw_line += " |site"
		vw_line += str(" site_id_")+str(row[5])
		vw_line += str(" site_domain_")+str(row[6])
		vw_line += str(" site_category_")+str(row[7])
		vw_line += " |app"
		vw_line += str(" app_id_")+str(row[8])
		vw_line += str(" app_domain_")+str(row[9])
		vw_line += str(" app_category_")+str(row[10])
		vw_line += " |device"
		vw_line += str(" device_id_")+str(row[11])
		vw_line += str(" device_ip_")+str(row[12])
		vw_line += str(" device_model_")+str(row[13])
		vw_line += str(" device_type_")+str(row[14])
		vw_line += str(" device_conn_type_")+str(row[15])
		vw_line += " |others"
		vw_line += str(" c14_")+str(row[16])
		vw_line += str(" c15_")+str(row[17])
		vw_line += str(" c16_")+str(row[18])
		vw_line += str(" c17_")+str(row[19])
		vw_line += str(" c18_")+str(row[20])
		vw_line += str(" c19_")+str(row[21])
		vw_line += str(" c20_")+str(row[22])
		vw_line += str(" c21_")+str(row[23])
		print (vw_line)

You may have noted that in the python code, I have also included ‘hour’ field by breaking it into four pieces. However, we will not use the ‘hour’ field in our learning machine. Also, it is a good idea to check beforehand if the input format is as per vw’s requirement. You can check this by pasting a few lines in vw validator here. Size of train.vw file is around 12.5gb i.e. more than double of ‘train.csv’.

Now that our vw file is ready we can feed it into VW machine. The command (just the first line) and its output is as below:

$ vw -d train.vw --cache_file  neural --inpass --passes 5 -q sd  -q ad  -q do -q fd --binary -f neural_model  --loss_function=logistic  --nn 3

creating quadratic features for pairs: sd ad do fd
final_regressor = neural_model
using input passthrough for neural network training
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = neural
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.000000   0.000000          1      1.0    -1.0000  -1.0000      102
0.000000   0.000000          2      2.0    -1.0000  -1.0000      102
0.000000   0.000000          4      4.0    -1.0000  -1.0000      102
0.000000   0.000000          8      8.0    -1.0000  -1.0000      102
0.062500   0.125000         16     16.0    -1.0000  -1.0000      102
0.125000   0.187500         32     32.0    -1.0000  -1.0000      102
0.234375   0.343750         64     64.0    -1.0000  -1.0000      102
0.234375   0.234375        128    128.0    -1.0000  -1.0000      102
0.179688   0.125000        256    256.0    -1.0000  -1.0000      102
0.179688   0.179688        512    512.0    -1.0000  -1.0000      102
0.177734   0.175781       1024   1024.0    -1.0000  -1.0000      102
0.176270   0.174805       2048   2048.0    -1.0000  -1.0000      102
0.183594   0.190918       4096   4096.0    -1.0000  -1.0000      102
0.180298   0.177002       8192   8192.0    -1.0000  -1.0000      102
0.174622   0.168945      16384  16384.0    -1.0000  -1.0000      102
0.177582   0.180542      32768  32768.0    -1.0000  -1.0000      102
0.175446   0.173309      65536  65536.0    -1.0000  -1.0000      102
0.173775   0.172104     131072 131072.0    -1.0000  -1.0000      102
0.170753   0.167732     262144 262144.0    -1.0000  -1.0000      102
0.163738   0.156723     524288 524288.0    -1.0000  -1.0000      102
0.157854   0.151970    1048576 1048576.0    -1.0000  -1.0000      102
0.161858   0.165862    2097152 2097152.0    -1.0000  -1.0000      102
0.172266   0.182673    4194304 4194304.0     1.0000  -1.0000      102
0.161030   0.149794    8388608 8388608.0    -1.0000  -1.0000      102
0.168042   0.175053   16777216 16777216.0    -1.0000  -1.0000      102
0.164841   0.161640   33554432 33554432.0    -1.0000  -1.0000      102
0.165551   0.165551   67108864 67108864.0    -1.0000  -1.0000      102 h
0.165849   0.166147   134217728 134217728.0    -1.0000  -1.0000      102 h

finished run
number of examples per pass = 36386071
passes used = 5
weighted example sum = 1.8193e+08
weighted label sum = -1.20141e+08
average loss = 0.164397 h
best constant = -1.58694
best constant's loss = 0.455592
total feature number = 18556896210
[ashokharnal@master clickthroughrate]$

Explanation of arguments to ‘vw‘ command is here: While processing the text file, train.vw, vw first converts it into a special binary format (cache file). This file is ‘neural’. The next time you again run the command, vw will use this file rather than the vw file. The contents of the cache file are argument dependent; if you run vw with different arguments, it is possible that a new cache file may be created. Number of '--passes' is 5. '--binary' is for binary classification. The model will be stored in file '-f neural_model'. The loss function for convergence is '--loss_function=logistic'. Why did I select ‘logistic’ loss function? The default loss function is ‘squared’. In many instances squared loss function leads to slower learning. An excellent and simple explanation about loss functions appropriate for neural networks is given by Michael Nielsen in his html book here. '--nn 3' represents a neural network with one hidden layer having 3 neurons. '--inpass' is for adding a further direct connection between the input and output layer. Arguments '-q sd -q ad -q do -q fd' are for creating interaction variables. An interaction variable is created from two variables ‘A’ and ‘B’ by multiplying the values of ‘A’ and ‘B’. It is a general technique and you may read more about it in Wikipedia. Argument '-q sd' means that all possible interaction variables are created from variables in the namespace ‘s’ (i.e. site) and namespace ‘d’ (i.e. device). Similar explanation holds for three other interactions ‘-q ad -q do -q fd’.

In our model building we have not used regularization functions. Regularization helps to avoid over fitting. For example, a non-linear model may become so non-linear as to connect all points of a class (including noise). This curve may have so many twists and turns that when model is run for finding out the class of an unclassified point, result may be confusing. Model building, therefore, attempts to penalize excessive twists and turns and hence ‘regularization’. For further exploration, it is worthwhile trying L1 (--l1) and L2 (--l2) regularizations; say, to start with: --l1 0.0005 and --l2 0.00004 . About regularizations in neural network you may like to read this work here.

Average loss of 0.164397 is an indicator of model fit accuracy (line 50 above). A direct and relative comparison of performances of various models can be made with this measure rather than loading the predictions to Kaggle.

The above algorithm takes around 2 hours to run on an 8GB machine and occupies at the maximum 1.5GB of RAM. Thus it is both fast and resource wise economical.

It is now time to predict clicks in test.csv. This file first needs to be converted to vw format. We add a click field to it but with a uniform click value of 1 in all records. This field is ignored while making predictions. The awk conversion code for this is as below. First field of ‘test.csv’ is ‘id’ field and is ignored.

#! /bin/awk -f
# # Call it as: ./test.awk test.csv > test.vw

BEGIN { FS = "," ; ORS = ""}
		{if (NR > 1 )
			{
	 	 	$2 = "1"  ;
 			print($2)
 	 		print(" |fe")
			print(" c1_") ; print $3
			print(" banner_pos_") ; print $4
			print(" |site")
			print(" site_id_") ; print $5
			print(" site_domain_") ; print $6
			print(" site_category_") ; print $7
			print(" |app")
			print(" app_id_"); print $8
			print(" app_domain_") ; print $9
			print(" app_category_") ; print($10)
			print(" |device")
			print(" device_id_") ; print($11)
			print(" device_ip_"); print($12)
			print(" device_model_") ; print($13)
			print(" device_type_") ; print($14)
			print(" device_conn_type_") ; print($15)
			print(" |others")
			print(" c14_") ; print($16)
			print(" c15_") ; print($17)
			print(" c16_") ; print($18)
			print(" c17_") ; print($19)
			print(" c18_") ; print($20)
			print(" c19_") ; print($21)
			print(" c20_") ; print($22)
			print(" c21_") ; print($23)
 			print("\n")
 			} } 

We, next, use the model prepared earlier to make predictions for test.vw. The vw command is:

$vw -d test.vw -t -i neural_model --link=logistic -p probabilities.txt

Argument '-t' is to indicate that we are feeding test file and class field is to be ignored. '-i' specifies the model file. We use '--link=logistic' to get probabilities. Use '--link=glf1' to get output between [-1,1]. The output file is ‘probabilities.txt’.

Kaggle requires that we submit results in the format ‘id,probability’ (without headers). A sample submission file is on the site which has all the IDs (same as in test.csv). We read this file in R and overwrite its second column with our predicted probabilities. R code for this is as below:

p<-read.table("probabilities.txt",header=FALSE)
samplesubmission<-read.csv("sampleSubmission.csv",header=TRUE,colClasses=c('character','numeric') )
samplesubmission[,2]<-p[1]
head(samplesubmission)
write.csv(samplesubmission,"neural_result.csv",quote=FALSE,row.names=FALSE)

File, neural_result.csv, (either as it is or zipped) can be submitted to Kaggle. Even though the competition is over, one does get a score. For this model the score was 0.3976030.

Kaggle score of 0.3976030: Post deadline

Kaggle score: Post deadline

Let us now once again review our model options. We have selected neural network option with just three neurons. This choice (it is believed) gives optimum results. But, it appears, we could have done without using neural network with just the following learning model:

$vw -d train.vw --cache_file  neural  --passes 5 -q sd  -q ad  -q do -q fd --binary -f logistic_model  --loss_function=logistic

With the above we get Kaggle score of 0.3994788. Thus, '--nn 3' option did make an improvement but not dramatic. We did not test '--nn' option by increasing the number of neurons. Increasing the number of '--passes' from 5 to 10 did not affect results; actual '--passes' used were 8. I may mention that VW’s default learning algorithm is online gradient descent.

We have initially used ‘Importance‘ of 2 while converting ‘train.csv’ file to ‘train.vw’ for clicks (of ‘1’). See line 22 in train.vw file above. This we did so that the learner would treat them as important events and not merely as noise (as such events were few).

We raised the ‘Importance’ to 3; the score degraded. We then changed the Importance to just 1 i.e. treated click events on par with non-click events. The Kaggle score was still 0.3976030. This meant that 20% clicks were sufficient to make good enough predictions for VW. This finishes our experiments with Vowpal Wabbit on CTR predictions.

Install Vowpal Wabbit on CentOS machine

February 20, 2015

Vowpal Wabbit is a fast machine learning engine. Here are the steps to install Vowpal Wabbit on a CentOS machine. My machine is CentOS 6.5. Problem in some of the CentOS machines is that they have older versions of g++ while vowpal wabbit requires newer updated versions.

=>Step 1: Install cmake, libtool and git
#  yum install  git   cmake   libtool

=>Step 2: Install boost and boost-devel
# yum  -y  install boost
# yum  -y  install boost-devel

==> Step 3: Check g++ version. Should be 4.8+
$ g++  –version

If g++ version is not 4.8 then install devtools as follows:

First create repo file:

# gedit  /etc/yum.repos.d/DevToolset.repo

And write to this file the following five lines and save:

[DevToolset-2]
name=RedHat DevToolset v2 $releasever – $basearch
baseurl=http://puias.princeton.edu/data/puias/DevToolset/6.5/x86_64/
enabled=1
gpgcheck=0

Next, download and install higher version of g++ :

# yum install devtoolset-2-gcc devtoolset-2-binutils devtoolset-2-gcc-c++

Your just installed g++ should be at: /opt/rh/devtoolset-2/root/usr/bin/g++. Rename your existing g++ file in /usr/bin/ folder to say /usr/bin/oldg++ . Then create a link at /usr/bin/ folder to just installed g++ file:

# ln  -s  /opt/rh/devtoolset-2/root/usr/bin/g++   /usr/bin/g++

=> Step 4: Download as follows vowpal wabbit from github to a localuser home folder (not necessarily root). A folder vowpal_wabbit will be created to contain all downloaded files.

$ git clone https://github.com/JohnLangford/vowpal_wabbit.git

=> Step 5: Install vowpal wabbit as:

$ cd  /home/localuser/vowpal_wabbit/
$ ./autogen.sh
$ ./configure
$ make
# make install      (as root)

This should finish installation of Vowpal Wabbit. If an error occurs and if any package is still to be installed that deficiency will be indicated by ./configure command above.

=> Step 6: Test your installation as:
$ vw   -h

This finishes installation of Vowpal Wabbit on CentOS machine. By and large installation is smooth. You can now rename just downloaded /usr/bin/g++ file and restore the earlier g++ file back in /usr/bin.

We have used Vowpal Wabbit in Click Through Rate prediction and in Handwritten digits recognition. See this blog and also this blog.

K-Nearest Neighbor classification using python

January 21, 2015

A number of open-source communities are using python to make available artificial intelligence and machine learning related packages and libraries. A list of communities and packages is available at PythonForArtificialIntelligence. In this blog I will use libraries from scikit-learn.

About sklearn, classes & datasets

Project scikit-learn is a Machine Learning Project in Python. It has a good collection of algorithms for some of the well known data-mining and data analysis jobs such as for Classification, Regression, Clustering, Dimensionality reduction and Model Selection. These algorithms are constructed on a stack of NumPy, SciPy library, and matplotlib. All these are part of SciPy.org ecosystem.

NumPy serves as a base package for scientific and mathematical computing. It has very powerful and efficient N-dimensional array manipulation capabilities. You can read it as Numeric arrays in Python. SciPy library is a collection of scientific and numerical computing tools. Numerical computing uses algorithmic and numerical approximation (rather than equation solving) approach for solving mathematical problems. For some of the complex mathematical problems, numerical solutions are easier, fast and accurate. For example, it is better to use an algorithmic approach of gradient descent to discover the minimum coordinates of a multidimensional surface rather than first mathematically describe the surface and then solve a complex set of equations to find its global or local minima. Matplotlib is a python 2-D plotting library for rendering high quality graphs.

Various python distributions are available that can install SciPy stack or SciPy.org ecosystem. A list is available here. One easy way to install the SciPy stack is through anaconda python distribution. Anaconda also installs IPython and pandas. Installation can be on Windows, Linux or Mac.

sklearn comes with a handy dataset module ‘sklearn.datasets‘. The module has few pre-installed small datasets and accompanying classes to load these datasets. It has classes to fetch some popular datasets from repositories such as mldata.org. It also has utilities to generate fictitious datasets for practice.

sklearn.neighbors

sklearn.neighbors‘ module provides unsupervised and supervised neighbors-based learning methods. Among these are classification and regression tools. You can have a quick look at all the classes to perform classification and other tasks here. In fact this is a one-page brief Reference for all the machine learning modules and classes under sklearn. I will strongly recommend that you visit this page if you are seriously learning python for machine learning using sklearn. Specifically, there are three classes that implement supervised classification algorithm based on nearest neighbor approach, namely, KNeighborsClassifier, RadiusNeighborsClassifier and NearestCentroid classifier. KNeighborsClassifier implements classification based on voting by nearest k-neighbors of target point, t, while RadiusNeighborsClassifier implements classification based on all neighborhood points within a fixed radius, r, of target point, t. In NearestCentroid classifier, each class is represented by the centroid of its members; thus the target point will be member of that class whose centroid is nearest to it. NearestCentroid algorithm is the simplest of the three and has no parameters to select from. Its results can be taken as the benchmark for evaluation purposes. Brief details about each of these three are given here.

KNeighborsClassifier

I have described K-Nearest Neighbor algorithm in my earlier blog. You can read about it also in Wikipedia. In this blog I will use KNeighborsClassifier class for classification purposes. Parameters (and their default values) in this class are as follows:

KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, metric=’minkowski’, p=2, metric_params=None)

n_neighbors‘ are the number of neighbors that will vote for the class of the target point; default number is 5. An odd number is preferred to avoid any tie. ‘weights‘ parameter has two choices: ‘uniform‘ and ‘distance‘. For the ‘uniform‘ weight, each of the k neighbors has equal vote whatever its distance from the target point. If the weight is ‘distance‘ then voting weightage or importance varies by inverse of distance; those points who are nearest to the target point have greater influence than those who are farther away. Parameter ‘algorithm‘ is for selecting the indexing data structure that will be used for speeding up neighborhood search; value of ‘auto‘ leaves it to algorithm to make the best choice among the three. I have described the three algorithms, brute, kd_tree and ball_tree in my earlier blog. Parameter ‘leaf_size‘ is the size of leaf in kd_tree or ball_tree. Larger the size, greater the speed of initial indexing structure formation but at the cost of delay in classification of target point. Parameter ‘metric‘ decides how distances are calculated in space. One familiar way is euclidean distance but then in some cases other measures of distances such as Manhattan distance are also used. A general formulation of distance metric is ‘minkowski’ distance. When parameter ‘p‘ is 2, it is the same as euclidean distance and when parameter ‘p‘ is 1, it is Manhattan distance. You can read more about them here. Last parameter ‘metric_params‘ is to provide any additional arguments to metric function.

Thus, class KNeighborsClassifier can be used without explicitly specifying the value of any parameter. Default values are already supplied. This makes it easy to use it for initial learning purposes and then to add parameter values, one by one. This is precisely what I will do. We will use iris dataset for our demo program. Iris dataset is available in sklearn itself or you can download it from UCI machine learning repository from here. A simple program is as below:

import numpy as np
from sklearn import neighbors, datasets		

# Load iris data from 'datasets module'
iris = datasets.load_iris()
#   Get data-records and record-labels in arrays X and y
X=iris.data
y=iris.target</pre>
# Create an instance of KNeighborsClassifier and then fit training data
clf = neighbors.KNeighborsClassifier()
clf.fit(X, y)
# Make class predictions for all observations in X
Z = clf.predict(X)
# Compare predicted class labels with actual class labels
accuracy=clf.score(X,y)
print ("Predicted model accuracy: "+ str(accuracy))
# Add a row of predicted classes to y-array for ease of comparison
A = np.vstack([y, Z])
print(A)

Instead, if you so like, you can download the iris dataset from UCI repository. Iris data, appears as follows; there are three classes of flowers: Iris-setosa, Iris-versicolor and Iris-virginica.
.

sepal_length,sepal_width,petal_length,petal_width,class
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica

We will read data using pandas and analyse it. The code for this is as follows.

import numpy as np
from sklearn import neighbors
import pandas as pd    				

#   Read from downloaded iris data file
df = pd.read_table('/home/ashokharnal/Documents/iris.data',sep=",")

# Separate four data attributes and class data (the 5th attribute)
#  Slice data-frame column wise. When slicing the data frame using iloc,
#    the start bound (0) is included, while the upper bound (4) is excluded.
X  =  df.iloc[:,0:4]	# X includes columns 0,1,2,3
y  =  df['class']	# Get last column

clf = neighbors.KNeighborsClassifier()
clf.fit(X, y)
Z = clf.predict(X)
accuracy=clf.score(X,y)
print ("Predicted model accuracy: "+ str(accuracy))
# Type of Z is numpy ndarray. Add, Z, to iris data frame as last column
df['Z']=Z
# Compare two classes: actual and predicted
df.iloc[:,4:6]

Let us add a bit of complexity to above code. We will use only two of Iris-data attributes for predicting flower-classes. On X-Y plane we plot the two attributes; divide the area (positive quadrant) using a grid; evaluate the class of each point on the grid and color that cell with one of the three colors to display its class. Since there are three classes, there will be three class-centroids and certain adjacent points will have one color (class) and far away points will have different color (class). This way we can demarcate class boundaries. The following code does this.

import numpy as np			
import pylab as pl				# for basic plotting jobs 
from matplotlib.colors import ListedColormap	# for mapping colours to an array of values 
from sklearn import neighbors
import pandas as pd    				

df = pd.read_table('/home/ashok/Documents/iris.data',sep=",")
#   Recode three class values: "Iris-setosa" as 0, Iris-versicolor as 1, Iris-virginica as 2
df['class'].replace('Iris-virginica',2,inplace=True)
df['class'].replace('Iris-versicolor',1,inplace=True)
df['class'].replace('Iris-setosa',0,inplace=True)

#   Slice df vertically to include only cols 0 & 1
X = df.iloc[:, :2]	# If X is numpy array then X = [:, :2]
y  =  df['class']	# Get last column

# Create list of three colors (corresponding to three class values)
light_colors =  ListedColormap(['blue', 'c', 'g'])	# 'c' for cyan and 'g' for green
bold_colors  =  ListedColormap(['r', 'k', 'yellow'])	# 'r' for red  and 'k' for black
light_colors.colors                                     # Just check colors in the list

# Modelling 
clf = neighbors.KNeighborsClassifier()
clf.fit(X, y)

# Fix four corners of grid boundaries: Corners are
#   as per min and max values of each of the two Iris attributes
x_min, x_max = X.iloc[:, 0].min() - 1, X.iloc[:, 0].max() + 1
y_min, y_max = X.iloc[:, 1].min() - 1, X.iloc[:, 1].max() + 1

# Create a mesh with bottom-left corner: (x_min,y_min) & 
#   top-right corner: (x_max,y_max). 
#     cell width & height: h. (Larger h leads to coarser class-boundaries)
#      'meshgrid' is very useful to evaluate functions on a grid.
h = 0.02  
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# You could display the grid with the following code
#   pl.plot(xx,yy,'r.')
#   pl.show()

# Use either of the following functions to flatten the multidimensional arrays
#   of xx & yy and make class predictions on those coordinates
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z1= clf.predict(np.c_[xx.flatten(), yy.flatten()])
Z2= clf.predict(np.c_[xx.reshape(len(xx)*len(xx[0]),1), yy.reshape(len(yy)*len(yy[0]),1)])

# Print Z values to understand the output
print(Z)
print(Z1)
print(Z2)

# Reshape Z as per xx array
Z = Z.reshape(xx.shape)
Z1=Z1.reshape(xx.shape)
Z2=Z2.reshape(xx.shape)

# Create figure in memory with default parameter values
pl.figure()

# Color every cell defined by arrays xx & yy as per values of Z
#   There are a total of xx*yy cells. Z is also of same size.
#     Each cell is colored in either of three colors as per corresponding value of Z (class).
pl.pcolormesh(xx, yy, Z,  cmap=light_colors)    # Invoke only one of the functions
pl.pcolormesh(xx, yy, Z1, cmap=light_colors)
pl.pcolormesh(xx, yy, Z2, cmap=light_colors)

# Plot Iris attribute coordinates on demarcated boundaries 
pl.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, cmap=bold_colors)
pl.title("Iris-classification (weights = '%s')"  % ('uniform'))
pl.axis('tight')
pl.show()

We will complicate the code further. We will now vary all parameters in class KNeighborsClassifier. Parameter values are as below:

n_neighbors:   Values 3 or 15
weights:       Values 'uniform' or 'distance'
algorithm:     Values 'auto', 'brute', 'ball_tree' or 'kd_tree'
p:             Values 1 or 2

We have altogether 2*2*4*2 = 32 choices. We use ‘for’ loop to cycle through these 32 combinations, one by one, evaluate accuracy and draw decision boundaries for each. The following code performs this.

import numpy as np
import pylab as pl
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
from sklearn.cross_validation import train_test_split
 
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
h = .01
# Create color maps from a list of colors
light_colors = ListedColormap(['blue', 'c', 'g'])
bold_colors  =  ListedColormap(['r', 'k', 'yellow'])
 
# uniform and distance are two arguments
for n_neighbors in [3,15]:
    for distancemetric in [1,2]:
        for algorithms in ['auto', 'ball_tree', 'kd_tree', 'brute']:
            for weights in ['uniform', 'distance']:
                if (distancemetric == 1):
                    d_metric="Manhattan distance"
                else:
                    d_metric="Euclidean distance"
 
                clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights,algorithm=algorithms,p=distancemetric )
                clf.fit(X, y)
 
                x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
                y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
                xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
                Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
                accuracy=clf.score(X,y)
                print("No of neighbors: "+str(n_neighbors)+", Distance metric: "+d_metric+", Algorithm is: " + algorithms +  ", weights: "+ weights+ ", Accuracy is: "+ str(accuracy))
     
                Z = Z.reshape(xx.shape)
                pl.figure()
                pl.pcolormesh(xx, yy, Z, cmap=light_colors )
                # Plot also the data points
                pl.scatter(X[:, 0], X[:, 1], c=y, cmap=bold_colors)
                pl.title("3-Class classification (k = %i, weights = '%s', algorithms ='%s',distance_metric= '%s')"  % (n_neighbors, weights,algorithms,d_metric))
                pl.axis('tight')               
pl.show()

In this blog, I have picked up example code liberally from sklearn site here. However, as for a novice that example is a little difficult to understand, I have tried to explain it here in steps. Also, there is a bit of amplification. Hope, it is useful. Bye!

A working example of K-d tree formation and K-Nearest Neighbor algorithms

January 20, 2015

In this blog and next few blogs, I will discuss implementation of k-nearest neighbor (k-NN) algorithm using python libraries. In this blog my objective is to explain in a simple manner how to speed up classification of a data-record by creation of data-structures in K-nearest neighbor algorithm. I will explain two indexing structures K-d tree and ball-tree, both of which are available in python libraries.

Generic K-Nearest Neighbor algorithm
K-NN is a classification algorithm and conceptually one of the simplest to understand. It is also called ‘Lazy Learner’ as against ‘Eager Learner’. Most classification algorithms are eager learners; there is a set of training data with example classifications. Training data is used to construct classification model. The model is used for evaluation on test data where data classifications are known. If the evaluated results are satisfactory, the final model is then used for making prediction of classes on data with unknown classifications. Eager learners have, therefore, already done most of their job of model formulation beforehand. A lazy learner, on the other hand, does not build any model beforehand; it waits for the unclassified data and then winds it way through the algorithm to make classification prediction. Lazy learners are, therefore, time consuming–each time a prediction is to be made all the model building effort has to be performed again.

In k-nearest neighbor algorithm, the example data is first plotted on an n-dimensional space where ‘n’ is the number of data-attributes. Each point in ‘n’-dimensional space is labeled with its class value. To discover classification of an unclassified data, the point is plotted on this n-dimensional space and class labels of nearest k data points are noted. Generally k is an odd number. That class which occurs for the maximum number of times among the k nearest data points is taken as the class of the new data-point. That is, decision is by voting of k neighboring points. One of the big advantages of this generic K-Nearest Neighbor algorithm for classification discovery is that it is amenable to parallel operations.

In the generic k-NN model, each time a prediction is to be made for a data point, first this data point’s distance from all other points is to be calculated and then only nearest k-points can be discovered for voting. This approach is also known as brute-force approach. When the volume of data is huge and its dimension is also very large, say, in hundreds or thousands, this repeated distance calculations can be very tedious and time consuming. To fasten up this process and so as to avoid measuring distances from all the points in the data set, some prepossessing of training data is done. This per-processing helps to search points which are likely to be in its neighborhood.

K-d tree formation — Speeding up K-NN
One way is to construct a sorted hierarchical data structure called k-d tree or k-dimensional tree. A k-dimensional tree is a binary tree. We illustrate its process of formation below through a working example for easy understanding.

Consider a three dimensional (training) data set shown in Table 0, below left. For convenience of representation we have not shown the fourth column containing class labels for each data record. We have three attributes ‘a’, ‘b’ and ‘c’. Among the three, attribute ‘b’ has the greatest variance. We sort the data set on this attribute (Table 1) and then divide it into two parts at the median.

    Table 0  			    Table 1		
  Unsorted data			 Sort on column b		
a	b	c		a	b	c
22	38	21		6	2	9
4	8	6		4	8	6
2	14	3		2	14	3
8	20	12		8	20	12
10	26	18		10	26	18
12	32	15		12	32	15<---
18	56	33		22	38	21
16	44	27		16	44	27
20	50	24		20	50	24
14	62	30		18	56	33
6	2	9		14	62	30

The median is at (12,32,15). Dividing Table 1 into two parts at the median gives us two tables, Table 2 and Table 3 as below. Next, from among the remaining (two) attributes, we select that dimension that has the greatest variance. This dimension is ‘c’. Again, we sort the two tables on this dimension and then break them at the respective medians.

          Break Table 1 on median (12,32,15)						
        >			       <	
     Table 2			     Table 3		
a	b	c		a	b	c
22	38	21		6	2	9
16	44	27		4	8	6
20	50	24		2	14	3
18	56	33		8	20	12
14	62	30		10	26	18

Tables sorted on column C are as below.

Sort Table 2 on column c 	Sort Table 3 on column c		
     Table 3			    Table 4		
a	b	c		a	b	c
22	38	21		2	14	3
20	50	24		4	8	6
16	44	27<---		6	2	9<---
14	62	30		8	20	12
18	56	33		10	26	18

Table 3 is next split at median, (16,44,27), and Table 4 is split at median, (6,2,9), as below.

        Break Table 3 on median  (16,44,27)						
         >				<
14	62	30		22	38	21
18	56	33		20	50	24
             Break Table 4 on median (6,2,9)
        >				<		
8	20	12		2	14	3
10	26	18		4	8	6

We have now four tables here. If we decide to end the splitting process then these four tables are tree-leaves. Else, next we would split by sorting column ‘a’ (and next split on ‘b’, ‘c’…).

Once this data structure is created, it is easy to find out the (approx) neighborhood of any point. For example, to find the neighborhood of a point (9,25,16), we move down the hierarchy left or right. First, at root node, we compare 25, with the value at the root (from column b) then at the next node we compare, 16 (from column c) and lastly 9. The data at the leaf are possible (but not necessarily) nearest points. Generally distances from the points in the table on the other side of this node are also calculated in order to discover nearest points. One may also move a step up the tree to discover nearest points. Incidentally, the median points (12,32,15, for example) are also made a part of either left or right sub-tree.

Ball-tree data structure
Another data structure to speed up discovery of neighborhood points is ball-tree data-structure. Ball tree data structure is very efficient especially in situations when number of dimensions is very large. A ball tree is also a binary tree with a hierarchical (binary) structure. To start with two clusters (each resembling a ball) are created. As it is a multidimensional space, each ball may be appropriately called a hypersphere. Any point in n-dimensional space will belong to either cluster but not to both. It will belong to the cluster from whose centroid its distance is less. If the distance of this point from the centroids of both the balls is same, it may be included in any one of the clusters. It is possible that both (virtual) hyper spheres may intersect but the points will belong to only one of the two. Next, each of the balls is again subdivided into two sub-clusters, again each resembling a ball; meaning thereby that in these sub-clusters again there are two centroids and membership of the point to a ball is decided based upon its distance from the centroid of the sub-cluster. We again sub-divide each of these sub-sub balls and so on up till certain depth.

An unclassified (target) point must fall within any one of the nested balls. Points within this nested ball are expected to be nearest to target point. Points in other nearby balls (or enveloping balls) may also be nearer to it (for example, this point may be at the boundary of one of the balls.) Nevertheless, one need not calculate the distance of this unclassified point from all the points in the n-dimensional space. This hastens up the classification process. Ball tree formation initially requires a lot of time and memory but once nested hyper-spheres are created and placed in memory discovery of nearest points becomes easier.

In my next blog, I have given examples of how to use classes in sklearn to perform K-NN classification.

Alternating Least Square Matrix factorization for Recommender engine using Mahout on hadoop–Part II

December 29, 2014

In our last blog we used Mahout’s ALS algorithm for discovering top-N recommendations for a user. In this blog we intend to find out predicted rating for any (userid, movieid) combination not necessarily in top-N. We will continue our work from where we concluded in the last blog. We are given a text file (‘uexp.test’) as below:

ID,user,movie
110,1,6
111,1,10
112,1,12
113,1,14
114,1,17
115,2,13
116,2,50

Our task is to create a text file with just ID and rating, something as below:

ID,rating
110,4
111,4
112,5
113,1
114,2
115,3
116,4

Specifically, from our work in the last blog, we will be using the results of command ‘mahout parallelALS ‘. You may like to peruse the earlier blog before proceeding with this one.

# Build model from command line.
mahout parallelALS --input $hdfs_movie_file \
                   --output $out_folder \
                   --lambda 0.1 \
                   --implicitFeedback false \
                   --alpha 0.8 \
                   --numFeatures 15 \
                   --numIterations 100 \
                   --tempDir $temp

##Parameters in mahout are
# lambda            Regularization parameter to avoid over-fitting
# implicitFeedback  User's preference is implied by his purchases (true) else false
# alpha             How confident are we of implicit feedback (used only when implicitFeedback is true)
#                   Of course, in the above case it is 'false' and hence alpha is not used.
# numFeatures       No of user and item features
# numIterations     No of iterations
# tempDir           Location of temporary working files

The outputs of this command are as below. All files in folders are sequence files:
1. user-feature matrix files in folder, $out_folder/U/, in hadoop
2. item-feature matrix files in folder, $out_folder/M/, in hadoop
3. user’s already known ratings for movies in folder, $out_folder/userRatings/

Steps that we follow are these:
1. Dump contents of U to see how many rows and columns this matrix contains
2. Dump contents of M to see how many rows and columns this matrix contains
3. Transpose user-feature matrix
4. Transpose item-feature matrix
5. Multiply the two transposed matrices using mahout’s command line tool to get user-item matrix. Mahout’s multiplication tool multiplies two matrices A and B as transpose(A) .B and not as A.B. Hence step 3 above.
6. Dump the results of multiplication to text files on local file system
7. Use bash-script to extract ratings for relevant (userid, movieid)
8. Write rating to a new file with ID and rating as mentioned earlier.

We use ‘mahout seqdumper‘ command to dump contents of two files under M to M0.txt and M1.txt.

# Dump M to see how many items are there
mahout seqdumper -i /user/ashokharnal/uexp1.out/M/part-m-00000 \
                 -o /home/ashokharnal/M0.txt
mahout seqdumper -i /user/ashokharnal/uexp1.out/M/part-m-00001 \
                 -o /home/ashokharnal/M1.txt

Open the two files, M0.txt and M1.txt, and read the count of Keys at the end; this number is the number of rows. Number of values gives number of columns. For example, sample (truncated) contents of ‘M0.txt’ are as follows:

Input Path: /user/ashokharnal/uexp1.out/M/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable
Key: 1: Value: {0:1.974276012,1:0.0831050238,2:0.273167938,3:0.1869148199,4:-0.1584698196,5:-0.1129896600,6:-0.237190063,7:-0.155191340,8:0.388374577,9:0.2465730276,10:0.00144229714,11:0.515006337,12:0.1919460605,13:0.1356261162,14:0.3860959700}
Key: 3: Value: {0:1.4980332750521874,1:0.13121572246681273,2:0.4705798204432542,3:-0.290794568454,4:0.76276377,5:-0.264316266,6:0.2338155050,7:0.328942047,8:-0.105526130,9:-0.063718423,10:0.219792438,11:-0.56082225,12:0.245130819,13:-0.32558446,14:0.67558142}
 |
 |
Count: 826

Similarly, in file ‘M1.txt’ Count is 824. Thus, total number of rows are 826+824=1650. And number of columns (no of values from 0 to 14) are 15. Dimensions of item-feature matrix are: 1650 X 15. 1650 is the number of items. In the same manner dump contents of U as below:

mahout seqdumper -i /user/ashokharnal/uexp1.out/U/part-m-00000  \
                 -o /home/ashokharnal/U0.txt
mahout seqdumper -i /user/ashokharnal/uexp1.out/U/part-m-00001  \
                 -o /home/ashokharnal/U1.txt

Dimensions of user-feature matrix are (471+472) X 15 or 943 X 15. 943 is the number of users. Our objective is to get user-item matrix, V, as below:

V = U. transpose(M)

However, mahout’s matrix multiplication tool multiplies matrices A.B as transpose(A).B. As transpose(transpose(A)) = A, we first transpose U as follows:

mahout transpose  -nr 943 -nc 15  \
                  -i /user/ashokharnal/uexp1.out/U/  \
                  --tempDir /tmp/transposeU

The transposed matrix folder will be generated in the same place where folder, U, is with a name something like ‘transpose-33’. In our case the transposed matrix location is: /user/ashokharnal/uexp1.out/transpose-33/. Similarly, transpose M with the command:

mahout transpose  -nr 1650 -nc 15   \
                  -i /user/ashokharnal/uexp1.out/M/  \
                  --tempDir /tmp/transposeM

The transposed matrix is created at: /user/ashokharnal/uexp1.out/transpose-195/.

We now have to multiply the two transposed matrices. We use ‘mahout matrixmult‘ utility as follows. (To get help on it, run the command: ‘mahout matrixmult -h‘ )

mahout matrixmult -nra 15 -nca 943 -nrb 15  -ncb 1650  \
                  -ia /user/ashokharnal/uexp1.out/transpose-33/  \
                  -ib  /user/ashokharnal/uexp1.out/transpose-195  \
                  --tempDir /tmp/useless

The matrix product folder is generated under /tmp folder with name something like: /tmp/productWith-35/. To work with this user-item matrix using bash, we dump the (sequence) product files to local file system, as follows:

mahout seqdumper -i /tmp/productWith-35/part-00000  \
                 -o /home/ashokharnal/p0.txt
mahout seqdumper -i /tmp/productWith-35/part-00001  \
                 -o /home/ashokharnal/p1.txt

Truncated output for one key (user) looks like as below. You can interpret it as follows: For a userid, 4, and movieid,  5, rating is 3.5528462109 or 4.

Key: 4: Value: {1:3.9332423190,2:3.3451704447,3:2.3199345720,4:3.779028659,5:3.5528462109,6:4.4076660033, ..... ......... ,1680:1.4181067023,1681:2.4794630145,1682:3.115888046}

By looking at the values above and reading them you might think that this matrix has 1 to 1682 values or 1682 columns. This is not so as many of ‘itemids’ are missing. Number of columns would be 1650 (I have not counted but you are free to do so). Adding up ‘Count’ values written at the end of two files (‘p0.txt’ and ‘p1.txt’), you would find the number of rows as: 943. To make our search easier for a particular userid (key) and particular itemid (value), we concatenate the two files as below:

# First remove the first two informative lines and the last (Count) line from the two files
sed '1,2d' p0.txt | sed '$d' > p00.txt
sed '1,2d' p1.txt | sed '$d' > p01.txt
# Append the first to second
cat p00.txt >> p01.txt

We now have the complete V matrix in file ‘p01.txt’ but in (Key,Value) format. We have a ‘uexp.test’ file (see sample file at the top of this blog). Before using this file we remove the first line (header). We use the following shell script to read userids and movieids from this file and then extract the corresponding rating value from ‘p01.txt’. The bash script is highly commented for easy understanding.

#!/bin/bash
##############
# Use as:
# cat uexp.test | ./extractandwrite.sh
# Format of uexp.test: 'ID,userid,movieid' but no header
##############

## Some constants. File 'p01.txt' contains matrix-product
input_file="/home/ashokharnal/p01.txt"
output_file="/home/ashokharnal/submitoutput.txt"

# Begin reading file 'uexp.test' line by line
while read line
do
    #1. Extract ID, userid and movieid from 1st, 2nd and 3rd fields
    ID=$(echo $line | awk 'BEGIN {FS=","} ; {print $1}' )
    userid=$(echo $line | awk 'BEGIN {FS=","} ; {print $2}' )
    movieid=$(echo $line | awk 'BEGIN {FS=","} ;{print $3}' )

    #2. Feed file 'p01.txt' to awk. Each line has four space-separated fields, as:
    #    'Key: 4: Value: {1:3.9332423190590364,2:3.3451704447958877}'
    #     For a userid ('4' above), assign 4th field to user_preferences
    user_preferences=$(awk -v  myvar="$userid:"   ' { if ($2==myvar) print $4  } ' $input_file)

    #3. Look now for movieid within user_preferences and extract colon separated rating.
    #    Shell utilities 'tr' and 'awk' are used. 'tr' translates three
    #      symbols ',' '{' & '}' (i.e. comma and two curly brackets) with newline ('\n').
    #       Consequently single line of variable user_preferences gets split into multiple lines.
    #          Each split-line contains a pair like 23:2.33456 i.e. just movieid:rating.
    #            Pipe each of the pairs into awk and match 1st field with movieid.
    #              When match occurs, awk extracts the 2nd field i.e. rating.
    #                Assign extracted value to 'dec_rating'
    dec_rating=$(echo $user_preferences |  tr , '\n' | tr { '\n' | tr } '\n' | awk -v myvar="$movieid" -F: '$1 == myvar {print $2}' )

    #4. Raise dec_rating value to ceiling integer. 5.001 becomes 6
    ceil_rating=$(perl -w -e "use POSIX; print ceil($dec_rating/1.0), qq{\n}")

    #4a. Limit max-rating to 5 and minimum to 1
    lt_rating=$(echo $ceil_rating | awk '$0<1{$0=1}$0>5{$0=5}1')

    #5. Concatenate lt_rating to ID
    revisedline="$ID,$lt_rating"

    #5a. Append it to output file
    echo $revisedline >> $output_file
done
echo "Output in file: "$output_file

As mentioned in the bash code, you have to run the above bash-script as below and output is appended to output_file, ‘/home/ashokharnal/submitoutput.txt’

# cd to where the two files are
cat uexp.test | ./extractandwrite.sh

This finishes our work.

Alternating Least Square Matrix factorization for Recommender engine using Mahout on hadoop–Part I

December 29, 2014

For a conceptual explanation of how matrix factorization applies to solving recommender problems, please see my earlier blog. Mahout offers Alternate Least Square algorithm for matrix factorization. User-item matrix, V, can be factorized into one, m X k, user-feature matrix, U, and an n X k item-feature matrix, M. Mahout ALS recommender engine can be used for large data sets that spread over many machines in hdfs environment. It thus enjoys advantage over techniques that require that all data fit into RAM of a single machine.

Mahout’s ALS recommender has the further advantage of making recommendations when user-item preferences are not explicit but implicit. In an explicit feedback environment, a user explicitly states his rating for an item. In an implicit feedback preference environment, user’s preference for an item is gathered when, for example, he shows an interest in an item by clicking relevant links or when he purchases an item or when he searches for an item. In an explicit feedback, rating scale can have any range, say, 1 to 5. In implicit feedback rating is either 0 or 1; maximum rating is 1.

For our experiments, we will use the MovieLens data set ml-100k. The data set consists of 100,000 ratings from 943 users for 1682 movies on a scale of 1-5. Each user has rated at least 20 movies. We will work with ‘u1.base’ file in this data set. File, u1.base, has data in the following tab-separated format:

1	1	5	874965758
1	2	3	876893171
1	3	4	878542960
2	1	4	888550871
2	10	2	888551853
2	14	4	888551853
2	19	3	888550871
3	317	2	889237482
3	318	4	889237482
3	319	2	889237026
4	210	3	892003374
4	258	5	892001374
4	271	4	892001690

The first column is userid, second column is movieid, the third is rating and the last is the time stamp. We are not concerned with time stamp here and hence we will remove this column. The following awk script will do this for us:

awk '{print $1"\t"$2"\t"$3}' u1.base > uexp.base

We will upload user-movie rating file ‘uexp.base’ to hadoop and then use mahout to first build user-item rating matrix and then factorize it. ‘mahout parallelALS’ can be used for matrix factorization. It uses ALS technique. To see what all are the arguments to ‘mahout parallelALS’, on a command line, issue the command:

mahout  parallelALS  –h

In our case, we have installed Cloudera hadoop eco-system on our machine. Mahout gets installed automatically with Cloudera. Of course, some initial configuration does need to be made. But we assume that mahout is configured and working on your system.

The following code snippet builds user-item matrix and factorizes it. The code is highly commented to make it understand.

#!/bin/bash
# We will use u1.base file which contains ratings data

#######Some constants##########
# Folder in local file system where user-rating file, u1.base, is placed
datadir="/home/ashokharnal/Documents/datasets/Movielens"
localfile=$datadir/"uexp.base"

# Folder in hadoop-filesystem to store user-rating file
ddir="/user/ashokharnal"
hdfs_movie_file=$ddir/"uexp.base"
# Folder where calculated factor matrices will be placed
out_folder=$ddir/"uexp.out"
# Temporary working folder in hdfs
temp="/tmp/itemRatings"
cd $datadir

# Remove time-stamps from user-rating file 
awk '{print $1"\t"$2"\t"$3}' u1.base > uexp.base

# Copy rating file from local filesystem to hadoop
hdfs dfs -put $localfile $ddir/
# Check if file-copying successful
hdfs dfs -ls $ddir/uexp.base

# Delete temporary working folder in hadoop, if it already exists
hdfs dfs -rm -r -f $temp

# Build model from command line.
mahout parallelALS --input $hdfs_movie_file \
                   --output $out_folder \
                   --lambda 0.1 \
                   --implicitFeedback false \
                   --alpha 0.8 \
                   --numFeatures 15 \
                   --numIterations 100 \
                   --tempDir $temp

##Parameters in mahout are
# lambda            Regularization parameter to avoid overfitting
# implicitFeedback  User's preference is implied by his purchases (true) else false
# alpha             How confident are we of implicit feedback (used only when implicitFeedback is true)
#                   Of course, in the above case it is 'false' and hence alpha is not used.
# numFeatures       No of user and item features
# numIterations     No of iterations 
# tempDir           Location of temporary working files 

The above script writes a user-feature matrix under folder (on hadoop), /user/ashokharnal/uexp.out/U/, an item-feature matrix under folder, /user/ashokharnal/uexp.out/M/. Matrix files under folders, U and M are in (binary) sequence file format. There is another folder /user/ashokharnal/uexp.out/userRatings/. Sequence files in this folder contain already known user ratings for various movieids.

We can now use user-feature matrix and item-feature matrix to calculate for any user top-N recommendations. For this purpose userids need to be stored in sequence file format where userid is the key and movieid (which will be ignored) is the value. We write the following text file:

1,6
2,13
3,245
4,50
46,682

In this file userids 1, 2, 3, 4, 46 are of interest while movieids written against them are arbitrary numbers. We are interested in finding out top-N recommendations for these userids. We convert this file to a sequence file using following Java code. There is presently no command line tool available from mahout to produce such file. The code is highly commented for ease of understanding. You can use NetBeans IDE to write, build and run this code. See this link.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.VectorWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.mahout.math.Vector;

public class VectorsFileForRecommender
	{
	public static void main(String[] args) throws IOException
		{
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		String input;
                String output;
                String line;
                input = "/home/ashokharnal/keyvalue.txt";
                output = "/home/ashokharnal/part-0000";

                // Create a file reader object
		BufferedReader reader = new BufferedReader(new FileReader(input));
                // Create a SequenceFile writer object
		SequenceFile.Writer writer = new SequenceFile.Writer( fs,conf,new Path(output), IntWritable.class, VectorWritable.class);
		
                // Read lines of input files, one record at a time
		while ((line = reader.readLine()) != null)
                    {
                    String[] rec;                           // A string array. 
                    rec = line.split(",");                  // Split line at comma delimiter and fill the string array 
                    double[] d = new double[rec.length];    // A double array of dimensaion rec.length
                    d[0] = Double.parseDouble(rec[0]);      // Double conversion needed for creating vector 
                    d[1] = Double.parseDouble(rec[1]);

                    // We will print, per record, lots of outputs to bring clarity 
                    System.out.println("------------------");
                    System.out.println("rec array length: "+rec.length);
                    System.out.println("userid: "+rec[0]);
                    System.out.println("Movieid: "+Double.parseDouble(rec[1]));
                  
                    // Create a Random access sparse vector. A random access sparse vector
                    //  only stores non-zero values and any value can be accessed randomly
                    //    as against sequential access.
                    // Class, RandomAccessSparseVector, implements vector to 
                    //   store non-zero values of type doubles
                    // We may either create a RandomAccessSparseVector object of size just 1, as:
                    //   Vector vec = new RandomAccessSparseVector(1);
                    // Or, create a RandomAccessSparseVector of size 2, as:
                    Vector vec = new RandomAccessSparseVector(rec.length);
                    
                    // Method ,assign(), applies the function to each element of the receiver
                    //   If RandomAccessSparseVector size is just 1, we may assign to it
                    //      either userid or movied. For example: vec.assign(d[1]) ;. 
                    // Argument to assign() can only be a double array or a a double value
                    //   or a vector but not integer or text.
                    vec.assign(d);      // Assign a double array to vector
                 
                    // Prepare for writing the vector in sequence file
                    //    Create an object of class VectorWritable and set its value
                    VectorWritable writable = new VectorWritable();
                    writable.set(vec);
                                        
                    // Check vector size
                    System.out.println("Vector size: "+ vec.size());
                    // Check vector value
                    System.out.println("Vector value: "+ vec.toString());
                    // Check what is actually being written to sequence file
                    System.out.println("Vector value being written: "+writable.toString());
                    System.out.println("Key value being written: "+d[0]);
                    
                    // Mahout sequence file for 'recommendfactorized' requires that key be of class IntWritable
                    //   and value which is ignored be of class VectorWritable.
                    // Append now line-input to sequence file in either way:
                    //   writer.append( new IntWritable(Integer.valueOf(rec[0])) , writable);
                    //    OR
                    writer.append( new IntWritable( (int) d[0]) , writable);
                    // Note: As value part of sequencefile is ignored, we could have written just
                    //       any arbitrary number to it instead of rec-array.
                    }
                writer.close();
                }
        }

This code produces a sequence file /home/ashokharnal/part-0000 on local file system. Copy this to hadoop to, say, /user/ashokharnal/part-0000. You can dump the content of this sequence file to a text file in your local file system using the following mahout command:

mahout seqdumper -i /user/ashokharnal/part-0000   -o /home/ashokharnal/dump.txt

File, dump.txt, has the following contents:

[ashokharnal@master ~]$ cat dump.txt
Input Path: /user/ashokharnal/part-0000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable
Key: 1: Value: {1:6.0,0:1.0}
Key: 2: Value: {1:13.0,0:2.0}
Key: 3: Value: {1:245.0,0:3.0}
Key: 4: Value: {1:50.0,0:4.0}
Key: 46: Value: {1:682.0,0:46.0}
Count: 5

As mentioned above, key is in ‘IntWritable’ class while value part in curly brackets is in ‘VectorWritable’ class. That key be in ‘IntWritable’ class and value be in ‘VectorWritable’ class is a requirement for our next step. We now use ‘mahout recommendfactorized‘ command to make top-N recommendations for specified users. The script is as follows:

# This script is in continuation of earlier bash script
#  Dollar constants used are as in the earlier bash script
mahout recommendfactorized \
       --input $ddir/part-0000  \
       --userFeatures $out_folder/U/ \
       --itemFeatures $out_folder/M/ \
       --numRecommendations 15 \
       --output /tmp/topNrecommendations \
       --maxRating 5
       
##Parameters
# input                   Sequence file with userids
# userFeatures            user-feature matrix
# itemFeatures            item-features matrix
# numRecommendations      top-N recommendations to make per user listed in the input file
# output                  top-N recommendations in descending order of importance
# maxRating               Maximum possible rating for any item

You can now open the text file under the hadoop folder ‘/tmp/topNrecommendations’. The top-N recommendations appear as follows:

1   [1449:5.0,119:4.8454704,169:4.837305,408:4.768415,474:4.710996,50:4.6945467,1142:4.692111,694:4.646718,127:4.614179,174:4.6061845,513:4.6046524,178:4.6008058,483:4.5722823,1122:4.5680165,12:4.5675783]
2	[1449:5.0,1512:4.9012074,1194:4.900322,1193:4.751229,242:4.7209125,178:4.717094,318:4.702426,661:4.700798,427:4.696252,302:4.6929398,357:4.6668787,1064:4.642539,603:4.6239715,98:4.5959983,694:4.5866184]
3	[902:5.0,320:4.716057,865:4.540695,1143:4.36101,340:4.310355,896:4.307236,179:4.195304,1368:4.194901,512:4.1928997,345:4.169366,1642:4.136083,445:4.094295,321:4.0908194,133:4.085527,423:4.06355]
4	[251:5.0,253:5.0,1368:5.0,190:5.0,1137:5.0,320:5.0,223:5.0,100:5.0,1449:5.0,1005:5.0,1396:5.0,10:5.0,1466:5.0,1099:5.0,1642:5.0]
46	[958:5.0,512:5.0,1449:5.0,1159:5.0,753:5.0,170:5.0,1642:5.0,408:5.0,1062:5.0,114:5.0,745:5.0,921:5.0,793:5.0,515:5.0,169:4.9880037]

In fact, if you are interested, you can build top-N recommendations for all users in one go using user-ratings (sequence) file (containing already known user-ratings as in file uexp.base) under folder ‘userRatings’. Recall that this folder was created by the earlier mahout script. The following is the mahout script.

mahout recommendfactorized \
       --input /user/ashokharnal/uexp.out/userRatings/  \
       --userFeatures $out_folder/U/ \
       --itemFeatures $out_folder/M/ \
       --numRecommendations 15 \
       --output /tmp/topNrecommendations \
       --maxRating 5

Incidentally, mahout does not necessarily need hadoop. It will work on local file system even if you do not have hadoop installed. To make mahout work on local file system and ignore hadoop altogether, set the value of variable, MAHOUT_LOCAL to any value, as:

export MAHOUT_LOCAL="abc"
 |
 |
# To make it work on hadoop, unset the value
export MAHOUT_LOCAL=""

It is better to set the value in ~/.bashrc and then run it as: source ~./bashrc

While we now know the top-N recommendations for a user, what if we were to find rating for a movie not in the top-N. In the next blog we will cover this aspect i.e. find predicted user-rating for any (userid,movieid).

That finishes this. Have a cup of coffee!

Matrix Factorization for Recommender engine–A simple explanation

December 28, 2014

Techniques of matrix-factorization are used for discovering preference of a customer regarding items for which he has not expressed any preference so far. Such discovery of preference helps e-commerce sites (or news-portal sites) to recommend to customers items (or news) that he could consider purchasing (reading). e-commerce sites hosts millions of items and out of these items, a customer’s preference is known only for very few of such items. Thus, if a user-item preference (or rating) matrix is constructed, most of the ratings in this matrix would be 0 (or NA i.e. not-available). A matrix such as this is known a sparse matrix. As most of the cells in this matrix are zero, to save space, many differing techniques are adopted to create sparse matrix. For example, if per row (i.e. per user), preference is expressed for say 100 items in columns of million items, then instead of storing 1 million integer values per row, one could just mention the rowno, colno and value of these 100 preferences, as below. Rest are implied to be of zero value.

[row_no_1  {(col_no, col_value), (col_no,col_value)....}]
[row_no_2 {(col_no, col_value), (col_no,col_value)....}]
[row_no_n  {(col_no, col_value), (col_no,col_value)....}]

In this way, per row, say, for 100 preferences, one may have to store not more than 420 integers/characters including brackets etc instead of a million (zero) numbers (integers). The matrix, now, not only occupies less space but also and possibly the whole of matrix can be brought into RAM for further processing. Many ingenious ways have been devised to store sparse matrices. Wikipedia can be a good starting point for understanding them. Many of these storage techniques also have accompanying APIs to carry out common matrix operations directly on them such as matrix multiplication, transpose etc. Thus, a number of matrix operations can be speeded up using sparse matrices.

A user-item sparse rating matrix, V, can, in turn, be conceived of the result of multiplication of two matrices, one a user-feature preference matrix and the other item-feature matrix. In a user-feature matrix of mobile, for example, there may be such features as color, weight etc. Same features may be there in an item-feature matrix. Item-feature matrix, may indicate for an item either presence or absence of a feature or its relative ‘level’ with respect to the value of same feature in an other similar item. Below, we have three tables. One, a user-feature matrix (4 X 4) at lower left, second, an item-feature matrix (4 X 4) at top-right and the third matrix on the bottom-right is the product of the two matrices. No-Entry, Dabbang and Gunday are names of three Indian movies.

Tables: 1 to 3. No-Entry, Dabbang and Gunday are Hindi movie names

No-Entry Dabbang Gunday
Comedy 1 0 0
horror 0 1 1
adult 0 0 0
general 1 1 1
comedy horror adult general No-Entry Dabbang Gunday
user1 4 3 1 3 user1 7 6 6
user2 5 3 3 2 user2 7 5 5
user3 3 2 1 1 user3 4 3 3
user4 2 1 2 2 user4 4 3 3

You may recall that the matrix-multiplication of two matrices A and B (A.B) is done as follows:

The value in the first cell is calculated as the dot product of first row of A with first column of B; the value in the second cell on the right is the dot-product of first row of A with the second column of B and so on. Thus, the first cell values will be: 4 X 1+3 X 0 + 1 X 0 +3 X 1= 7 and the second cell-value is: 4 X 0 + 3X 1 + 1 X 0 + 1 X 3 = 6. Thus, each cell value is the dot-product of corresponding user-row with item-column. In general a product-matrix is calculated through a set of equations, as:

a11*b11 + a12 * b21 + a13 * b31 + a14 * b41 =7
a21*b12 + a22 * b22 + a23 * b32 + a24 * b42 =6
|
|

where a11, a12, a13 and a14 are column-wise coefficients of first row of matrix A and b11, b12, b13 and b14 are row-wise coefficients of first column of matrix B. Please refer to Wikipedia for further details.

In reverse, given just the product (generally called matrix V), it might be possible to solve the above set of equations (sixteen equations, one for each cell, in our case) for values of a11, a12, b11, b12 and so on i.e. be able to find the two multiplicand matrices, A and B.

Given a user-item rating matrix, we presume that there exists a (latent, not disclosed to us) user-feature matrix and an item-feature matrix. We do not know what those latent features are and how many of them are there; all that we know is that features do exist. But why are we interested in factorizing a sparse user-item feature matrix into two user-item and feature-item matrices? Consider the following (not so) sparse initial user-item matrix. Our interest is to find user preference even for those items where it is not known (represented by 0).

Table: 4 Initial user-item rating matrix

No-Entry Dabbang Gunday
user1 5 0 2
user2 4 0 0
user3 0 2 0
user4 7 0 4

Manually, by trial and error, we try to factorize it into two matrices. The following is the result (ignore a bit of sparseness in factors). User-feature matrix (U) is as below. We do not know the names of these features. Therefore, we just name them f1-f4.

Table: 5 Derived user-feature matrix

f1 f2 f3 f4
user1 2 1 2 0
user2 1 1 1 1
user3 0 1 0 1
user4 2 1 2 2

And item-feature matrix (M) of:

Table: 6 Derived item-feature matrix

No-Entry Dabbang Gunday
f1 1 0 0
f2 1 1 0
f3 1 0 1
f4 1 1 1

If we multiply the two matrix-factors in Table 5 and Table 6, we get the following user-item matrix:

Table: 7 Derived user-item matrix

No-Entry Dabbang Gunday
user1 5 1 2
user2 4 2 2
user3 2 2 1
user4 7 3 4

Compare it with Table 4. We are able to fill in the blanks and get to know user’s preference even for those cases where his preference was not earlier known. Note that in this particular case the starting point, user-item matrix (Table 4), was not so sparse and enough information was able to factorize. Factors, on multiplication, yielded no error for already known ratings (bold numbers in Table 7). In real world, we are not so lucky. The initial user-item matrix is very very sparse. And effort is to factorize so that the root-mean square error (RMSE) for initial known ratings is minimized. There are many ways to achieve this minimum and hence we have a number of algorithms for matrix factorization.

But how did we know there would be four features? Well we did not know. We could have guessed five features and the result would have been as below (again ignore a bit of sparseness in factors). Note that the derived user-item ratings matrix is now different from the one at Table 7 above even though RMSE is still zero.

Tables: 8-10: Factorization with five features

 

No Entry Dabbang Gunday
f1 1 0 0
f2 1 1 0
f3 1 0 1
f4 1 1 1
f5 1 0 1
f1 f2 f3 f4 f5 No Entry Dabbang Gunday
user1 2 1 1 0 1 user1 5 1 2
user2 1 1 1 1 0 user2 4 2 2
user3 1 1 1 1 1 user3 3 2 3
user4 2 1 2 2 0 user4 7 3 4

So how many features do we select? In practice, the training data is divided into two parts. Training data and test data. Training data is used to build a number of user-item matrices with varying number of features. For each result, RMSE is calculated for user-items ratings given in test data. That model is selected for which this RMSE is minimum.

The initial sparse user-item matrix in Table 4 can also be factored in the following way (again ignore a bit of sparseness in factors):

Tables: 11-13: Another set of factors but with negative elements

No Entry Dabbang Gunday
f1 1 0 0
f2 1 1 0
f3 1 0 1
f4 1 1 1
f5 2 0 1
f1 f2 f3 f4 f5 No Entry Dabbang Gunday
user1 2 -1 1 -1 2 user1 5 0 2
user2 1 1 2 2 -1 user2 4 3 3
user3 1 -2 0 4 1 user3 5 2 5
user4 2 1 2 2 0 user4 7 3 4

But this time we have negative preferences in user-feature matrix. Sparser the matrix, more the possible number of factors. Generally we go for Non-negative matrix factorization (NMF) rather than for factors with negative elements in them. Factors with negative elements are difficult to interpret. Also numerical algorithms rather than equation-solving is used in matrix factorization. Hence the factorization is generally approximate rather than exact with non-negativity imposing one constraint on factorization.

But why at all do we interpret factors of user-item matrix as user-feature matrix and item-feature matrix? Maybe what we call user-feature matrix is in fact, user-user (some-relationship) matrix and so also for the other factor! We will explain this intuitively. Let us assume that they are really so and we will show that this assumption does help us understand calculation of a user’s rating by matrix multiplication. If V is m X n user-item matrix, W is m X k  user-feature matrix and M is n X k item-feature matrix, then:

V = W . transpose(M)
or,        V =W.H
where, H = transpose(M)

k is the number of features. Generally k is much less than either m, the number of users, or n, the number of items. For example, in a 943 X 1063 user-item matrix, we may have two non-negative factors of size 943 X 15 and 1063 X 15. The original matrix had 943 * 1063 = 1002409 elements while the factors have in all: 943*15 + 1063*15 = 30090 elements; this is about 30 times less than the original matrix. Factors, therefore, compress information.

Further, now every user’s preference for each feature is represented row-wise in matrix W. On the other hand, how much of each feature is contained in a particular item is represented column wise in H. To determine a user’s rating for an item, we multiply the specific row of the user with the particular column of the item or in other words user’s preference for a feature is weighted by how much of that feature is contained in that item and the result is summed up over all features.  It is this intuitive interpretation of NMF that makes it useful for treating the two factors of V, W and H as being user-(hidden) feature matrix and item-(hidden) feature matrix rather than being something else.  W is appropriately called ‘basis’ matrix while H is called ‘coefficient’ matrix. (As an aside note here that how much of a feature is contained in an item should be positive and not negative and hence non-negative matrix factorization). The matrix:

Vi = W.hi

is a one-column list of respective ratings of all users for one item ‘i’ represented by (coefficient) column hi of H.

In the next blog I will explain how to factorize user-item rating matrix using Alternating Least Square method. We will use Mahout for the purpose.