Posts Tagged ‘create projection matrix’

Experimenting with Random Projections—A Non-maths approach

April 1, 2015

Random Projections is a dimensionality reduction technique that is very simple to implement. Yet, understanding its mathematics is equally tough. Johnson-Lindenstrauss lema that forms the basis of this technique uses mathematical symbols that are hard to comprehend even for a person who has had some deeper background in mathematics. In this blog, I will experiment with the results that follow from this lemma. Results are easy to understand and experiment with. Such experimentation does provide a lot of insight as to what happens behind the scenes and how the dimensionality reduction technique works.

Dimensionality Reduction is achieved in this manner: For a feature matrix, A, of dimensions r X n, a projection matrix, P of dimensions n X k is created (where k << n). Matrix multiplication of A with P renders the transformed feature matrix, S, with dimensions r X k. Relative euclidean distances between points of A are maintained between points in S. Note that A has r-rows, one for each object and hence r-points in n-dimensional space; S also has r-objects but in k-dimensional space. While reducing dimensions of the feature matrix A to k-dimensions, relative Euclidean distances between points in the n-dimension space are maintained in the newly projected k-dimensions. In the discussions that follows, we call P, the projection matrix.

We will work with MNIST handwritten digits set. You can download it from here. Our tasks are, a) to create a projection matrix for this data set, b) use projection matrix to transform to lower-dimensions the train digit-dataset, c) train k-nn classifier on this reduced-dimension (training) digit-set, d) transform also the test digit-dataset to lower dimensions using the same projection matrix, and e) make predictions and determine the prediction accuracy for transformed test-set. We will do this for projection matrices constructed in three different manner and also by varying the number of projections.

Constructing Projection matrices

There are three ways a projection matrix, P, of dimension r X k can be constructed. All the three ways are independent of data in the feature-data set (i.e. MNIST data set in our case). One method is to create an array of random numbers having N(0,1) distribution (i.e. standard normal distribution with mean 0 and standard deviation 1). The number of elements in the array are: r X k. This array is then shaped into a matrix of r X k size. The other two ways are mentioned in the paper by Dimitris Achlioptas, ‘Database-friendly random projections: Johnson-Lindenstrauss with binary coins‘. In effect the paper says that the elements of the projection matrix may be filled with two numbers, either -1 or +1. Each of the two has equal probability of selection that is 0.5. The third way is to fill up the projection matrix with any of the three numbers: sqrt(3), 0 or -sqrt(3) having probability distribution 1/6, 2/3 and 1/6. We are going to try all the three methods: while we will use python for training K-NN classifier on MNIST dataset yet we will not use any pre-built sklearn class to create our projection matrix; we will do work from scratch (so to say). For example, the following three-line code creates a projection matrix with dimension (11000,45), filled with N(0,1) random numbers.

rn=np.random.standard_normal(11000*45)
rn=rn.reshape(11000,45)
transformation_matrix = np.asmatrix(rn)
  

Experimenting with projection matrix filled with N(0,1) numbers

As random projections preserve the relative Euclidean distances between points, it will be appropriate to use a classifier that bases its classification decisions on Euclidean distance as one of the proximity measures. Hence our choice for K-Nearest Neighbour classifier. The training data has 785 columns, one for label and 784 for pixel intensity values on gray scale. Each handwritten image has 28 X 28 =784 pixels. As is well known there is a high degree of correlation between adjacent pixel intensities and hence there is a large scope for dimensionality reduction. We will vary dimensions from 20 to 500 and compare the accuracy of classification as against when all 784 columns are used for prediction. The following python code does this. The code is liberally commented.


import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn import neighbors
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
%pylab

# Change directory to data folder
%cd /home/ashokharnal/

# Upload MNIST database
# Header is at row 0
digits = pd.read_csv("/home/ashokharnal/train.csv", header=0, sep=',')	
# Know its dimesions
digits.shape
# Separate target and predictors
y=digits['label']	# Target
X=digits.iloc[:,1:]	# Predictors

# Convert X to float for scaling it (it is a requirement of scaling class)
Z=X.astype('float64')

# Pre-process data (pixel values). Scale it between 0 and 1
min_max_scaler = preprocessing.MinMaxScaler()
X_train = min_max_scaler.fit_transform(Z)
# X_train is numpy array. Convert it to data frame (though this conversion is not needed)
F=pd.DataFrame(X_train)
# Reassign it column names
F.columns=X.columns
# Split predictors and targets in ratio of 70:30 
split = train_test_split(F,y, test_size = 0.3,random_state = 42)
# split returns four arrays
(train_X, test_X, train_y, test_y) = split

print("Building k-nn model with all columns")
clf = neighbors.KNeighborsClassifier()
clf.fit(train_X,train_y)
Z = clf.predict(test_X)
max_accuracy=clf.score(test_X,test_y)
print("Model max possible accuracy : "+ str(max_accuracy))

# Generate 'num' integer values between 20 and 500.
num=20
projected_dimensions = np.int32(np.linspace(20, 500, num))

# An empty list to store accuracy scores
accuracy_list = []
# Begin iteration per projection
for r_projections in projected_dimensions:
	print("Next iteration:-")
	print("    No of columns to be projected: "+str(r_projections))
	
        # Create an array of N(0,1) values
	rn=np.random.standard_normal(size=(digits.shape[1] - 1)*r_projections)
	rn=rn.reshape((digits.shape[1] - 1),r_projections)
	transformation_matrix=np.asmatrix(rn)
	data=np.asmatrix(train_X)

        # Multiply data matrix with projection matrix to reduce dimensions 
        # X is the transformed matrix 
        X = data * transformation_matrix
	print("    Training digit data transformed to dimension: " + str(X.shape))

	# Train a classifier on the transformed matrix
	print("    Creating model")
	model = neighbors.KNeighborsClassifier()
	model.fit(X,train_y)
	
        # Evaluate the model on test set
        # Ist reduce dimensions of test data using the same transformation
	test_X=np.asmatrix(test_X)
	test = test_X * transformation_matrix

        # Make prediction and check score
	acc=model.score(test,test_y)
	print("    Accuracy achieved: "+str(acc) )
	accuracy_list.append(acc)
	print("------------")    

# create graph of accuracies achieved
plt.figure()
plt.suptitle("Accuracy of Random Projections on MNIST handwritten digits")
plt.xlabel("No of projected dimensions")
plt.ylabel("Accuracy")
plt.xlim([2, 500])
plt.ylim([0, 1.0])
 
# plot the maximum achievable accuracy against random projection accuracies
plt.plot(projected_dimensions, [max_accuracy] * len(accuracy_list), color = "r")
plt.plot(projected_dimensions, accuracy_list)
plt.show()

Accuracy results can be seen in the following graph. Even for dimension as low as 20, prediction accuracy is remarkable.

Random projections as per Gaussian distribution

Random projections as per Gaussian distribution. Red line is max possible accuracy with all 784 columns considered.

It can be observed that with dimensions a little higher than 200 as much accuracy is achieved as with dimensions of 784. An account of accuracy as recorded is given below.


Projection matrix with N(0,1) numbers
=======================================

Model max possible accuracy : 0.965317460317

First iteration:-
No of columns  projected: 20
Training digit-data transformed to dimension: (29400, 20)
Accuracy achieved: 0.837063492063
------------
Next iteration:-
No of columns  projected: 45
Training digit-data transformed to dimension: (29400, 45)
Accuracy achieved: 0.923333333333
------------
Next iteration:-
No of columns  projected: 70
Training digit-data transformed to dimension: (29400, 70)
Accuracy achieved: 0.94380952381
------------
Next iteration:-
No of columns  projected: 95
Training digit-data transformed to dimension: (29400, 95)
Accuracy achieved: 0.952063492063
------------
Next iteration:-
No of columns  projected: 121
Training digit-data transformed to dimension: (29400, 121)
Accuracy achieved: 0.955238095238
------------
Next iteration:-
No of columns  projected: 146
Training digit-data transformed to dimension: (29400, 146)
Accuracy achieved: 0.955793650794
------------
Next iteration:-
No of columns  projected: 171
Training digit-data transformed to dimension: (29400, 171)
Accuracy achieved: 0.95626984127
------------
Next iteration:-
No of columns  projected: 196
Training digit-data transformed to dimension: (29400, 196)
Accuracy achieved: 0.956984126984
------------
Next iteration:-
No of columns  projected: 222
Training digit-data transformed to dimension: (29400, 222)
Accuracy achieved: 0.959285714286
------------
Next iteration:-
No of columns  projected: 247
Training digit-data transformed to dimension: (29400, 247)
Accuracy achieved: 0.961031746032
------------
Next iteration:-
No of columns  projected: 272
Training digit-data transformed to dimension: (29400, 272)
Accuracy achieved: 0.960476190476
------------
Next iteration:-
No of columns  projected: 297
Training digit-data transformed to dimension: (29400, 297)
Accuracy achieved: 0.960793650794
------------
Next iteration:-
No of columns  projected: 323
Training digit-data transformed to dimension: (29400, 323)
Accuracy achieved: 0.960952380952
------------
Next iteration:-
No of columns  projected: 348
Training digit-data transformed to dimension: (29400, 348)
Accuracy achieved: 0.960793650794
------------
Next iteration:-
No of columns  projected: 373
Training digit-data transformed to dimension: (29400, 373)
Accuracy achieved: 0.962380952381
------------
Next iteration:-
No of columns  projected: 398
Training digit-data transformed to dimension: (29400, 398)
Accuracy achieved: 0.960714285714
------------
Next iteration:-
No of columns  projected: 424
Training digit-data transformed to dimension: (29400, 424)
Accuracy achieved: 0.961825396825
------------
Next iteration:-
No of columns  projected: 449
Training digit-data transformed to dimension: (29400, 449)
Accuracy achieved: 0.961984126984
------------
Next iteration:-
No of columns  projected: 474
Training digit-data transformed to dimension: (29400, 474)
Accuracy achieved: 0.963015873016
------------
Next iteration:-
No of columns  projected: 500
Training digit-data transformed to dimension: (29400, 500)
Accuracy achieved: 0.963492063492
------------

Experimenting with projection matrix of binary numbers

We now fill the projection matrix with binary numbers: -1 and +1, each with probability 0.5. The following few lines of python code will create such a projection matrix. For an explanation of how this code works, you may please refer here.


def weighted_values(values, probabilities, size):
    bins = np.add.accumulate(probabilities)
    return values[np.digitize(np.random.random_sample(size), bins)]

values = np.array([1.0, -1.0])
probabilities = np.array([0.5,0.5])
#  1.0 occurs with p=0.5 (1st value) and -1.0 occurs with p=0.5 (2nd value)
rn=weighted_values(values, probabilities, (11000*45))
rn=rn.reshape(11000,45)
transformation_matrix=np.asmatrix(rn)
  

The k-nn training model code on the lines similar to earlier code is below. I have removed most comments.

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn import neighbors
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
%pylab

%cd /home/ashokharnal/Documents/
digits = pd.read_csv("/home/ashokharnal/Documents/train.csv", header=0, sep=',')	

# Separate target and predictors
y=digits['label']	# Target
X=digits.iloc[:,1:]	# Predictors

# Convert X to float for scaling it (it is a requirement of scaling)
Z=X.astype('float64')

# Preprocess data. Scale it between 0 and 1
min_max_scaler = preprocessing.MinMaxScaler()
X_train = min_max_scaler.fit_transform(Z)

# Convert it to data frame 
F=pd.DataFrame(X_train)
F.columns=X.columns
# Split predictors and targets in ratio of 70:30 
split = train_test_split(F,y, test_size = 0.3,random_state = 42)
(train_X, test_X, train_y, test_y) = split

print("Building k-nn model with all columns")
clf = neighbors.KNeighborsClassifier()
clf.fit(train_X,train_y)
Z = clf.predict(test_X)
max_accuracy=clf.score(test_X,test_y)
print("Model max possible accuracy : "+ str(max_accuracy))

# Function to generate discrete random variables with specified weights
def weighted_values(values, probabilities, size):
    bins = np.add.accumulate(probabilities)
    return values[np.digitize(np.random.random_sample(size), bins)]

values = np.array([1.0, -1.0])
probabilities = np.array([0.5,0.5])

# Get 'num' integer values between 20 and 500.
num=20
projected_dimensions = np.int32(np.linspace(20, 500, num))
accuracy_list = []

for r_projections in projected_dimensions:
	print("Next iteration:-")
	print("    No of columns to be projected: "+str(r_projections))
	# Create an array of binary values
	rn=weighted_values(values, probabilities, (digits.shape[1] - 1)*r_projections)
	rn=rn.reshape((digits.shape[1] - 1),r_projections)
	transformation_matrix=np.asmatrix(rn)
	data=np.asmatrix(train_X)
	X = data * transformation_matrix
	print("    Training digit data transformed to dimension: " + str(X.shape))

	# Train a classifier on the random projection
	print("    Creating model")
	model = neighbors.KNeighborsClassifier()
	model.fit(X,train_y)
	
	test_X=np.asmatrix(test_X)
	test = test_X * transformation_matrix
	acc=model.score(test,test_y)
	print("    Accuracy achieved: "+str(acc) )
	accuracy_list.append(acc)
	print("------------")    

# Create accuracy graph
plt.figure()
plt.suptitle("Accuracy of Random Projections on MNIST handwritten digits")
plt.xlabel("No of projected dimensions")
plt.ylabel("Accuracy")
plt.xlim([2, 500])
plt.ylim([0, 1.0])
 
# plot the maximum achievable accuracy against random projection accuracies
plt.plot(projected_dimensions, [max_accuracy] * len(accuracy_list), color = "r")
plt.plot(projected_dimensions, accuracy_list)
plt.show()

The following graph displays the accuracy results.

Projection matrix with binary values

Projection matrix with binary values. Red line is max possible accuracy with all 784 columns considered.

Again as can be seen that even with such a simple projection matrix, dimensions of around 200 give near maximum possible accuracy. An account of accuracy achieved vs. projected columns is given below. You will note that dimensions of 45 provide prediction accuracy of around 93%.

Using +1 and -1 as matrix elements
---------------------------------

Model max possible accuracy : 0.965317460317

First iteration:-
No of columns  projected: 20
Training digit-data transformed to dimension: (29400, 20)
Accuracy achieved: 0.842698412698
------------
Next iteration:-
No of columns  projected: 45
Training digit-data transformed to dimension: (29400, 45)
Accuracy achieved: 0.928492063492
------------
Next iteration:-
No of columns  projected: 70
Training digit-data transformed to dimension: (29400, 70)
Accuracy achieved: 0.948253968254
------------
Next iteration:-
No of columns  projected: 95
Training digit-data transformed to dimension: (29400, 95)
Accuracy achieved: 0.949682539683
------------
Next iteration:-
No of columns  projected: 121
Training digit-data transformed to dimension: (29400, 121)
Accuracy achieved: 0.957857142857
------------
Next iteration:-
No of columns  projected: 146
Training digit-data transformed to dimension: (29400, 146)
Accuracy achieved: 0.956507936508
------------
Next iteration:-
No of columns  projected: 171
Training digit-data transformed to dimension: (29400, 171)
Accuracy achieved: 0.955714285714
------------
Next iteration:-
No of columns  projected: 196
Training digit-data transformed to dimension: (29400, 196)
Accuracy achieved: 0.959126984127
------------
Next iteration:-
No of columns  projected: 222
Training digit-data transformed to dimension: (29400, 222)
Accuracy achieved: 0.958095238095
------------
Next iteration:-
No of columns  projected: 247
Training digit-data transformed to dimension: (29400, 247)

Accuracy achieved: 0.960238095238
------------
Next iteration:-
No of columns  projected: 272
Training digit-data transformed to dimension: (29400, 272)
Accuracy achieved: 0.961031746032
------------
Next iteration:-
No of columns  projected: 297
Training digit-data transformed to dimension: (29400, 297)
Accuracy achieved: 0.960317460317
------------
Next iteration:-
No of columns  projected: 323
Training digit-data transformed to dimension: (29400, 323)
Accuracy achieved: 0.960476190476
------------
Next iteration:-
No of columns  projected: 348
Training digit-data transformed to dimension: (29400, 348)
Accuracy achieved: 0.960873015873
------------
Next iteration:-
No of columns  projected: 373
Training digit-data transformed to dimension: (29400, 373)
Accuracy achieved: 0.961666666667
------------
Next iteration:-
No of columns  projected: 398
Training digit-data transformed to dimension: (29400, 398)
Accuracy achieved: 0.96253968254
------------
Next iteration:-
No of columns  projected: 424
Training digit-data transformed to dimension: (29400, 424)
Accuracy achieved: 0.96126984127
------------
Next iteration:-
No of columns  projected: 449
Training digit-data transformed to dimension: (29400, 449)
Accuracy achieved: 0.960238095238
------------
Next iteration:-
No of columns  projected: 474
Training digit-data transformed to dimension: (29400, 474)
Accuracy achieved: 0.962222222222
------------
Next iteration:-
No of columns  projected: 500
Training digit-data transformed to dimension: (29400, 500)
Accuracy achieved: 0.962619047619
------------

Experimenting with projection matrix–Another probability distribution

We will now construct projection matrix using another simple probability distribution. This distribution has three possible values: sqrt(3), 0 and -sqrt(3) i.e 1.7320508075, 0.0, and -1.7320508075 with probabilities 1/6, 2/3 and 1/6. Similar to earlier, the following python code will populate a matrix with this distribution:

def weighted_values(values, probabilities, size):
    bins = np.add.accumulate(probabilities)
    return values[np.digitize(np.random.random_sample(size), bins)]

values = np.array([1.7320508075, 0.0, -1.7320508075])
probabilities = np.array([1.0/6.0, 2.0/3.0, 1.0/6.0])
rn=weighted_values(values, probabilities, (11000*45)
rn=rn.reshape(11000,45)
transformation_matrix=np.asmatrix(rn)
  

The python code for K-Nearest Neighbour modelling using this transformation matrix to reduce dimensions is given below. We have removed all comments.


import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn import neighbors
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
%pylab

%cd /home/ashokharnal/Documents/
digits = pd.read_csv("/home/ashokharnal/Documents/train.csv", header=0, sep=',')	
y=digits['label']	# Target
X=digits.iloc[:,1:]	# Predictors
Z=X.astype('float64')

min_max_scaler = preprocessing.MinMaxScaler()
X_train = min_max_scaler.fit_transform(Z)
F=pd.DataFrame(X_train)
F.columns=X.columns
split = train_test_split(F,y, test_size = 0.3,random_state = 42)
(train_X, test_X, train_y, test_y) = split

print("Building k-nn model with all columns")
clf = neighbors.KNeighborsClassifier()
clf.fit(train_X,train_y)
Z = clf.predict(test_X)
max_accuracy=clf.score(test_X,test_y)

print("Model max possible accuracy : "+ str(max_accuracy))

def weighted_values(values, probabilities, size):
    bins = np.add.accumulate(probabilities)
    return values[np.digitize(np.random.random_sample(size), bins)]

values = np.array([1.7320508075, 0.0, -1.7320508075])
probabilities = np.array([1.0/6.0, 2.0/3.0, 1.0/6.0])

num=20
projected_dimensions = np.int32(np.linspace(20, 500, num))
accuracy_list = []

for r_projections in projected_dimensions:
	print("Next iteration:-")
	print("   No of columns to be projected: "+str(r_projections))
	# Create an array of N(0,1) values
	rn=weighted_values(values, probabilities, (digits.shape[1] - 1)*r_projections)
	rn=rn.reshape((digits.shape[1] - 1),r_projections)
	transformation_matrix=np.asmatrix(rn)
	data=np.asmatrix(train_X)
	X = data * transformation_matrix
	print("   Training digit-data transformed to dimension: " + str(X.shape))

	# train a classifier on the random projection
	print("   Creating model")
	model = neighbors.KNeighborsClassifier()
	model.fit(X,train_y)
	# evaluate the model and update accuracy_list
	test_X=np.asmatrix(test_X)
	test = test_X * transformation_matrix
	acc=model.score(test,test_y)
	print("   Accuracy achieved: "+str(acc) )
	accuracy_list.append(acc)
	print("------------")    

plt.figure()
plt.suptitle("Accuracy of Random Projections on MNIST handwritten digits")
plt.xlabel("No of projected dimensions")
plt.ylabel("Accuracy")
plt.xlim([2, 500])
plt.ylim([0, 1.0])
 
plt.plot(projected_dimensions, [max_accuracy] * len(accuracy_list), color = "r")
plt.plot(projected_dimensions, accuracy_list)
plt.show()

Graph of accuracy score vs projected columns is as below.

Projection matrix with three possible values.

Projection matrix with three possible values. Red line is max possible accuracy with all 784 columns considered.

Again, as before, when the number of columns increase, the predicted accuracy increases and is maximum achievable when number of columns are a little more than 200. An account of accuracy achieved vs. projected columns in this case is given below. As before, even with dimensions as low as 45, prediction accuracy of digits is around 93%.


Using sqrt(3), 0 and -sqrt(3) as matrix elements
---------------------------------
First iteration:-
No of columns  projected: 20
Training digit-data transformed to dimension: (29400, 20)
Accuracy achieved: 0.829523809524
------------
Next iteration:-
No of columns  projected: 45
Training digit-data transformed to dimension: (29400, 45)
Accuracy achieved: 0.928015873016
------------
Next iteration:-
No of columns  projected: 70
Training digit-data transformed to dimension: (29400, 70)
Accuracy achieved: 0.94253968254
------------
Next iteration:-
No of columns  projected: 95
Training digit-data transformed to dimension: (29400, 95)
Accuracy achieved: 0.950317460317
------------
Next iteration:-
No of columns  projected: 121
Training digit-data transformed to dimension: (29400, 121)
Accuracy achieved: 0.951825396825
------------
Next iteration:-
No of columns  projected: 146
Training digit-data transformed to dimension: (29400, 146)
Accuracy achieved: 0.954285714286
------------
Next iteration:-
No of columns  projected: 171
Training digit-data transformed to dimension: (29400, 171)
Accuracy achieved: 0.959206349206
------------
Next iteration:-
No of columns  projected: 196
Training digit-data transformed to dimension: (29400, 196)
Accuracy achieved: 0.958095238095
------------
Next iteration:-
No of columns  projected: 222
Training digit-data transformed to dimension: (29400, 222)
Accuracy achieved: 0.958650793651
------------
Next iteration:-
No of columns  projected: 247
Training digit-data transformed to dimension: (29400, 247)
Accuracy achieved: 0.960634920635
------------
Next iteration:-
No of columns  projected: 272
Training digit-data transformed to dimension: (29400, 272)
Accuracy achieved: 0.96
------------
Next iteration:-
No of columns  projected: 297
Training digit-data transformed to dimension: (29400, 297)
Accuracy achieved: 0.960158730159
------------
Next iteration:-
No of columns  projected: 323
Training digit-data transformed to dimension: (29400, 323)
Accuracy achieved: 0.959603174603
------------
Next iteration:-
No of columns  projected: 348
Training digit-data transformed to dimension: (29400, 348)
Accuracy achieved: 0.961507936508
------------
Next iteration:-
No of columns  projected: 373
Training digit-data transformed to dimension: (29400, 373)
Accuracy achieved: 0.962222222222
------------
Next iteration:-
No of columns  projected: 398
Training digit-data transformed to dimension: (29400, 398)
Accuracy achieved: 0.963253968254
------------
Next iteration:-
No of columns  projected: 424
Training digit-data transformed to dimension: (29400, 424)
Accuracy achieved: 0.961507936508
------------
Next iteration:-
No of columns  projected: 449
Training digit-data transformed to dimension: (29400, 449)
Accuracy achieved: 0.963888888889
------------
Next iteration:-
No of columns  projected: 474
Training digit-data transformed to dimension: (29400, 474)
Accuracy achieved: 0.963095238095
------------
Next iteration:-
No of columns  projected: 500
Training digit-data transformed to dimension: (29400, 500)
Accuracy achieved: 0.963888888889
------------

This finishes our experiments in random projections.

Advertisements