Random Projections is a dimensionality reduction technique that is very simple to implement. Yet, understanding its mathematics is equally tough. Johnson-Lindenstrauss lema that forms the basis of this technique uses mathematical symbols that are hard to comprehend even for a person who has had some deeper background in mathematics. In this blog, I will experiment with the results that follow from this lemma. Results are easy to understand and experiment with. Such experimentation does provide a lot of insight as to what happens behind the scenes and how the dimensionality reduction technique works.

Dimensionality Reduction is achieved in this manner: For a feature matrix, A, of dimensions r X n, a projection matrix, P of dimensions n X k is created (where k << n). Matrix multiplication of A with P renders the transformed feature matrix, S, with dimensions r X k. Relative euclidean distances between points of A are maintained between points in S. Note that A has r-rows, one for each object and hence r-points in n-dimensional space; S also has r-objects but in k-dimensional space. While reducing dimensions of the feature matrix A to k-dimensions, relative Euclidean distances between points in the n-dimension space are maintained in the newly projected k-dimensions. **In the discussions that follows, we call P, the projection matrix.**

We will work with MNIST handwritten digits set. You can download it from here. Our tasks are, a) to create a projection matrix for this data set, b) use projection matrix to transform to lower-dimensions the train digit-dataset, c) train k-nn classifier on this reduced-dimension (training) digit-set, d) transform also the test digit-dataset to lower dimensions using the same projection matrix, and e) make predictions and determine the prediction accuracy for transformed test-set. We will do this for projection matrices constructed in three different manner and also by varying the number of projections.

**Constructing Projection matrices**

There are three ways a projection matrix, P, of dimension r X k can be constructed. All the three ways are independent of data in the feature-data set (i.e. MNIST data set in our case). One method is to create an array of random numbers having N(0,1) distribution (i.e. standard normal distribution with mean 0 and standard deviation 1). The number of elements in the array are: r X k. This array is then shaped into a matrix of r X k size. The other two ways are mentioned in the paper by Dimitris Achlioptas, ‘Database-friendly random projections: Johnson-Lindenstrauss with binary coins‘. In effect the paper says that the elements of the projection matrix may be filled with two numbers, either -1 or +1. Each of the two has equal probability of selection that is 0.5. The third way is to fill up the projection matrix with any of the three numbers: sqrt(3), 0 or -sqrt(3) having probability distribution 1/6, 2/3 and 1/6. We are going to try all the three methods: while we will use python for training K-NN classifier on MNIST dataset yet we will not use any pre-built sklearn class to create our projection matrix; we will do work from scratch (so to say). For example, the following three-line code creates a projection matrix with dimension (11000,45), filled with N(0,1) random numbers.

rn=np.random.standard_normal(11000*45) rn=rn.reshape(11000,45) transformation_matrix = np.asmatrix(rn)

**Experimenting with projection matrix filled with N(0,1) numbers**

As random projections preserve the relative Euclidean distances between points, it will be appropriate to use a classifier that bases its classification decisions on Euclidean distance as one of the proximity measures. Hence our choice for K-Nearest Neighbour classifier. The training data has 785 columns, one for label and 784 for pixel intensity values on gray scale. Each handwritten image has 28 X 28 =784 pixels. As is well known there is a high degree of correlation between adjacent pixel intensities and hence there is a large scope for dimensionality reduction. We will vary dimensions from 20 to 500 and compare the accuracy of classification as against when all 784 columns are used for prediction. The following python code does this. The code is liberally commented.

import numpy as np import pandas as pd from sklearn import preprocessing from sklearn import neighbors from sklearn.cross_validation import train_test_split import matplotlib.pyplot as plt %pylab # Change directory to data folder %cd /home/ashokharnal/ # Upload MNIST database # Header is at row 0 digits = pd.read_csv("/home/ashokharnal/train.csv", header=0, sep=',') # Know its dimesions digits.shape # Separate target and predictors y=digits['label'] # Target X=digits.iloc[:,1:] # Predictors # Convert X to float for scaling it (it is a requirement of scaling class) Z=X.astype('float64') # Pre-process data (pixel values). Scale it between 0 and 1 min_max_scaler = preprocessing.MinMaxScaler() X_train = min_max_scaler.fit_transform(Z) # X_train is numpy array. Convert it to data frame (though this conversion is not needed) F=pd.DataFrame(X_train) # Reassign it column names F.columns=X.columns # Split predictors and targets in ratio of 70:30 split = train_test_split(F,y, test_size = 0.3,random_state = 42) # split returns four arrays (train_X, test_X, train_y, test_y) = split print("Building k-nn model with all columns") clf = neighbors.KNeighborsClassifier() clf.fit(train_X,train_y) Z = clf.predict(test_X) max_accuracy=clf.score(test_X,test_y) print("Model max possible accuracy : "+ str(max_accuracy)) # Generate 'num' integer values between 20 and 500. num=20 projected_dimensions = np.int32(np.linspace(20, 500, num)) # An empty list to store accuracy scores accuracy_list = [] # Begin iteration per projection for r_projections in projected_dimensions: print("Next iteration:-") print(" No of columns to be projected: "+str(r_projections)) # Create an array of N(0,1) values rn=np.random.standard_normal(size=(digits.shape[1] - 1)*r_projections) rn=rn.reshape((digits.shape[1] - 1),r_projections) transformation_matrix=np.asmatrix(rn) data=np.asmatrix(train_X) # Multiply data matrix with projection matrix to reduce dimensions # X is the transformed matrix X = data * transformation_matrix print(" Training digit data transformed to dimension: " + str(X.shape)) # Train a classifier on the transformed matrix print(" Creating model") model = neighbors.KNeighborsClassifier() model.fit(X,train_y) # Evaluate the model on test set # Ist reduce dimensions of test data using the same transformation test_X=np.asmatrix(test_X) test = test_X * transformation_matrix # Make prediction and check score acc=model.score(test,test_y) print(" Accuracy achieved: "+str(acc) ) accuracy_list.append(acc) print("------------") # create graph of accuracies achieved plt.figure() plt.suptitle("Accuracy of Random Projections on MNIST handwritten digits") plt.xlabel("No of projected dimensions") plt.ylabel("Accuracy") plt.xlim([2, 500]) plt.ylim([0, 1.0]) # plot the maximum achievable accuracy against random projection accuracies plt.plot(projected_dimensions, [max_accuracy] * len(accuracy_list), color = "r") plt.plot(projected_dimensions, accuracy_list) plt.show()

Accuracy results can be seen in the following graph. Even for dimension as low as 20, prediction accuracy is remarkable.

It can be observed that with dimensions a little higher than 200 as much accuracy is achieved as with dimensions of 784. An account of accuracy as recorded is given below.

Projection matrix with N(0,1) numbers ======================================= Model max possible accuracy : 0.965317460317 First iteration:- No of columns projected: 20 Training digit-data transformed to dimension: (29400, 20) Accuracy achieved: 0.837063492063 ------------ Next iteration:- No of columns projected: 45 Training digit-data transformed to dimension: (29400, 45) Accuracy achieved: 0.923333333333 ------------ Next iteration:- No of columns projected: 70 Training digit-data transformed to dimension: (29400, 70) Accuracy achieved: 0.94380952381 ------------ Next iteration:- No of columns projected: 95 Training digit-data transformed to dimension: (29400, 95) Accuracy achieved: 0.952063492063 ------------ Next iteration:- No of columns projected: 121 Training digit-data transformed to dimension: (29400, 121) Accuracy achieved: 0.955238095238 ------------ Next iteration:- No of columns projected: 146 Training digit-data transformed to dimension: (29400, 146) Accuracy achieved: 0.955793650794 ------------ Next iteration:- No of columns projected: 171 Training digit-data transformed to dimension: (29400, 171) Accuracy achieved: 0.95626984127 ------------ Next iteration:- No of columns projected: 196 Training digit-data transformed to dimension: (29400, 196) Accuracy achieved: 0.956984126984 ------------ Next iteration:- No of columns projected: 222 Training digit-data transformed to dimension: (29400, 222) Accuracy achieved: 0.959285714286 ------------ Next iteration:- No of columns projected: 247 Training digit-data transformed to dimension: (29400, 247) Accuracy achieved: 0.961031746032 ------------ Next iteration:- No of columns projected: 272 Training digit-data transformed to dimension: (29400, 272) Accuracy achieved: 0.960476190476 ------------ Next iteration:- No of columns projected: 297 Training digit-data transformed to dimension: (29400, 297) Accuracy achieved: 0.960793650794 ------------ Next iteration:- No of columns projected: 323 Training digit-data transformed to dimension: (29400, 323) Accuracy achieved: 0.960952380952 ------------ Next iteration:- No of columns projected: 348 Training digit-data transformed to dimension: (29400, 348) Accuracy achieved: 0.960793650794 ------------ Next iteration:- No of columns projected: 373 Training digit-data transformed to dimension: (29400, 373) Accuracy achieved: 0.962380952381 ------------ Next iteration:- No of columns projected: 398 Training digit-data transformed to dimension: (29400, 398) Accuracy achieved: 0.960714285714 ------------ Next iteration:- No of columns projected: 424 Training digit-data transformed to dimension: (29400, 424) Accuracy achieved: 0.961825396825 ------------ Next iteration:- No of columns projected: 449 Training digit-data transformed to dimension: (29400, 449) Accuracy achieved: 0.961984126984 ------------ Next iteration:- No of columns projected: 474 Training digit-data transformed to dimension: (29400, 474) Accuracy achieved: 0.963015873016 ------------ Next iteration:- No of columns projected: 500 Training digit-data transformed to dimension: (29400, 500) Accuracy achieved: 0.963492063492 ------------

**Experimenting with projection matrix of binary numbers**

We now fill the projection matrix with binary numbers: -1 and +1, each with probability 0.5. The following few lines of python code will create such a projection matrix. For an explanation of how this code works, you may please refer here.

def weighted_values(values, probabilities, size): bins = np.add.accumulate(probabilities) return values[np.digitize(np.random.random_sample(size), bins)] values = np.array([1.0, -1.0]) probabilities = np.array([0.5,0.5]) # 1.0 occurs with p=0.5 (1st value) and -1.0 occurs with p=0.5 (2nd value) rn=weighted_values(values, probabilities, (11000*45)) rn=rn.reshape(11000,45) transformation_matrix=np.asmatrix(rn)

The k-nn training model code on the lines similar to earlier code is below. I have removed most comments.

import numpy as np import pandas as pd from sklearn import preprocessing from sklearn import neighbors from sklearn.cross_validation import train_test_split import matplotlib.pyplot as plt %pylab %cd /home/ashokharnal/Documents/ digits = pd.read_csv("/home/ashokharnal/Documents/train.csv", header=0, sep=',') # Separate target and predictors y=digits['label'] # Target X=digits.iloc[:,1:] # Predictors # Convert X to float for scaling it (it is a requirement of scaling) Z=X.astype('float64') # Preprocess data. Scale it between 0 and 1 min_max_scaler = preprocessing.MinMaxScaler() X_train = min_max_scaler.fit_transform(Z) # Convert it to data frame F=pd.DataFrame(X_train) F.columns=X.columns # Split predictors and targets in ratio of 70:30 split = train_test_split(F,y, test_size = 0.3,random_state = 42) (train_X, test_X, train_y, test_y) = split print("Building k-nn model with all columns") clf = neighbors.KNeighborsClassifier() clf.fit(train_X,train_y) Z = clf.predict(test_X) max_accuracy=clf.score(test_X,test_y) print("Model max possible accuracy : "+ str(max_accuracy)) # Function to generate discrete random variables with specified weights def weighted_values(values, probabilities, size): bins = np.add.accumulate(probabilities) return values[np.digitize(np.random.random_sample(size), bins)] values = np.array([1.0, -1.0]) probabilities = np.array([0.5,0.5]) # Get 'num' integer values between 20 and 500. num=20 projected_dimensions = np.int32(np.linspace(20, 500, num)) accuracy_list = [] for r_projections in projected_dimensions: print("Next iteration:-") print(" No of columns to be projected: "+str(r_projections)) # Create an array of binary values rn=weighted_values(values, probabilities, (digits.shape[1] - 1)*r_projections) rn=rn.reshape((digits.shape[1] - 1),r_projections) transformation_matrix=np.asmatrix(rn) data=np.asmatrix(train_X) X = data * transformation_matrix print(" Training digit data transformed to dimension: " + str(X.shape)) # Train a classifier on the random projection print(" Creating model") model = neighbors.KNeighborsClassifier() model.fit(X,train_y) test_X=np.asmatrix(test_X) test = test_X * transformation_matrix acc=model.score(test,test_y) print(" Accuracy achieved: "+str(acc) ) accuracy_list.append(acc) print("------------") # Create accuracy graph plt.figure() plt.suptitle("Accuracy of Random Projections on MNIST handwritten digits") plt.xlabel("No of projected dimensions") plt.ylabel("Accuracy") plt.xlim([2, 500]) plt.ylim([0, 1.0]) # plot the maximum achievable accuracy against random projection accuracies plt.plot(projected_dimensions, [max_accuracy] * len(accuracy_list), color = "r") plt.plot(projected_dimensions, accuracy_list) plt.show()

The following graph displays the accuracy results.

Again as can be seen that even with such a simple projection matrix, dimensions of around 200 give near maximum possible accuracy. An account of accuracy achieved vs. projected columns is given below. You will note that dimensions of 45 provide prediction accuracy of around 93%.

Using +1 and -1 as matrix elements --------------------------------- Model max possible accuracy : 0.965317460317 First iteration:- No of columns projected: 20 Training digit-data transformed to dimension: (29400, 20) Accuracy achieved: 0.842698412698 ------------ Next iteration:- No of columns projected: 45 Training digit-data transformed to dimension: (29400, 45) Accuracy achieved: 0.928492063492 ------------ Next iteration:- No of columns projected: 70 Training digit-data transformed to dimension: (29400, 70) Accuracy achieved: 0.948253968254 ------------ Next iteration:- No of columns projected: 95 Training digit-data transformed to dimension: (29400, 95) Accuracy achieved: 0.949682539683 ------------ Next iteration:- No of columns projected: 121 Training digit-data transformed to dimension: (29400, 121) Accuracy achieved: 0.957857142857 ------------ Next iteration:- No of columns projected: 146 Training digit-data transformed to dimension: (29400, 146) Accuracy achieved: 0.956507936508 ------------ Next iteration:- No of columns projected: 171 Training digit-data transformed to dimension: (29400, 171) Accuracy achieved: 0.955714285714 ------------ Next iteration:- No of columns projected: 196 Training digit-data transformed to dimension: (29400, 196) Accuracy achieved: 0.959126984127 ------------ Next iteration:- No of columns projected: 222 Training digit-data transformed to dimension: (29400, 222) Accuracy achieved: 0.958095238095 ------------ Next iteration:- No of columns projected: 247 Training digit-data transformed to dimension: (29400, 247) Accuracy achieved: 0.960238095238 ------------ Next iteration:- No of columns projected: 272 Training digit-data transformed to dimension: (29400, 272) Accuracy achieved: 0.961031746032 ------------ Next iteration:- No of columns projected: 297 Training digit-data transformed to dimension: (29400, 297) Accuracy achieved: 0.960317460317 ------------ Next iteration:- No of columns projected: 323 Training digit-data transformed to dimension: (29400, 323) Accuracy achieved: 0.960476190476 ------------ Next iteration:- No of columns projected: 348 Training digit-data transformed to dimension: (29400, 348) Accuracy achieved: 0.960873015873 ------------ Next iteration:- No of columns projected: 373 Training digit-data transformed to dimension: (29400, 373) Accuracy achieved: 0.961666666667 ------------ Next iteration:- No of columns projected: 398 Training digit-data transformed to dimension: (29400, 398) Accuracy achieved: 0.96253968254 ------------ Next iteration:- No of columns projected: 424 Training digit-data transformed to dimension: (29400, 424) Accuracy achieved: 0.96126984127 ------------ Next iteration:- No of columns projected: 449 Training digit-data transformed to dimension: (29400, 449) Accuracy achieved: 0.960238095238 ------------ Next iteration:- No of columns projected: 474 Training digit-data transformed to dimension: (29400, 474) Accuracy achieved: 0.962222222222 ------------ Next iteration:- No of columns projected: 500 Training digit-data transformed to dimension: (29400, 500) Accuracy achieved: 0.962619047619 ------------

**Experimenting with projection matrix–Another probability distribution**

We will now construct projection matrix using another simple probability distribution. This distribution has three possible values: sqrt(3), 0 and -sqrt(3) i.e 1.7320508075, 0.0, and -1.7320508075 with probabilities 1/6, 2/3 and 1/6. Similar to earlier, the following python code will populate a matrix with this distribution:

def weighted_values(values, probabilities, size): bins = np.add.accumulate(probabilities) return values[np.digitize(np.random.random_sample(size), bins)] values = np.array([1.7320508075, 0.0, -1.7320508075]) probabilities = np.array([1.0/6.0, 2.0/3.0, 1.0/6.0]) rn=weighted_values(values, probabilities, (11000*45) rn=rn.reshape(11000,45) transformation_matrix=np.asmatrix(rn)

The python code for K-Nearest Neighbour modelling using this transformation matrix to reduce dimensions is given below. We have removed all comments.

import numpy as np import pandas as pd from sklearn import preprocessing from sklearn import neighbors from sklearn.cross_validation import train_test_split import matplotlib.pyplot as plt %pylab %cd /home/ashokharnal/Documents/ digits = pd.read_csv("/home/ashokharnal/Documents/train.csv", header=0, sep=',') y=digits['label'] # Target X=digits.iloc[:,1:] # Predictors Z=X.astype('float64') min_max_scaler = preprocessing.MinMaxScaler() X_train = min_max_scaler.fit_transform(Z) F=pd.DataFrame(X_train) F.columns=X.columns split = train_test_split(F,y, test_size = 0.3,random_state = 42) (train_X, test_X, train_y, test_y) = split print("Building k-nn model with all columns") clf = neighbors.KNeighborsClassifier() clf.fit(train_X,train_y) Z = clf.predict(test_X) max_accuracy=clf.score(test_X,test_y) print("Model max possible accuracy : "+ str(max_accuracy)) def weighted_values(values, probabilities, size): bins = np.add.accumulate(probabilities) return values[np.digitize(np.random.random_sample(size), bins)] values = np.array([1.7320508075, 0.0, -1.7320508075]) probabilities = np.array([1.0/6.0, 2.0/3.0, 1.0/6.0]) num=20 projected_dimensions = np.int32(np.linspace(20, 500, num)) accuracy_list = [] for r_projections in projected_dimensions: print("Next iteration:-") print(" No of columns to be projected: "+str(r_projections)) # Create an array of N(0,1) values rn=weighted_values(values, probabilities, (digits.shape[1] - 1)*r_projections) rn=rn.reshape((digits.shape[1] - 1),r_projections) transformation_matrix=np.asmatrix(rn) data=np.asmatrix(train_X) X = data * transformation_matrix print(" Training digit-data transformed to dimension: " + str(X.shape)) # train a classifier on the random projection print(" Creating model") model = neighbors.KNeighborsClassifier() model.fit(X,train_y) # evaluate the model and update accuracy_list test_X=np.asmatrix(test_X) test = test_X * transformation_matrix acc=model.score(test,test_y) print(" Accuracy achieved: "+str(acc) ) accuracy_list.append(acc) print("------------") plt.figure() plt.suptitle("Accuracy of Random Projections on MNIST handwritten digits") plt.xlabel("No of projected dimensions") plt.ylabel("Accuracy") plt.xlim([2, 500]) plt.ylim([0, 1.0]) plt.plot(projected_dimensions, [max_accuracy] * len(accuracy_list), color = "r") plt.plot(projected_dimensions, accuracy_list) plt.show()

Graph of accuracy score vs projected columns is as below.

Again, as before, when the number of columns increase, the predicted accuracy increases and is maximum achievable when number of columns are a little more than 200. An account of accuracy achieved vs. projected columns in this case is given below. As before, even with dimensions as low as 45, prediction accuracy of digits is around 93%.

Using sqrt(3), 0 and -sqrt(3) as matrix elements --------------------------------- First iteration:- No of columns projected: 20 Training digit-data transformed to dimension: (29400, 20) Accuracy achieved: 0.829523809524 ------------ Next iteration:- No of columns projected: 45 Training digit-data transformed to dimension: (29400, 45) Accuracy achieved: 0.928015873016 ------------ Next iteration:- No of columns projected: 70 Training digit-data transformed to dimension: (29400, 70) Accuracy achieved: 0.94253968254 ------------ Next iteration:- No of columns projected: 95 Training digit-data transformed to dimension: (29400, 95) Accuracy achieved: 0.950317460317 ------------ Next iteration:- No of columns projected: 121 Training digit-data transformed to dimension: (29400, 121) Accuracy achieved: 0.951825396825 ------------ Next iteration:- No of columns projected: 146 Training digit-data transformed to dimension: (29400, 146) Accuracy achieved: 0.954285714286 ------------ Next iteration:- No of columns projected: 171 Training digit-data transformed to dimension: (29400, 171) Accuracy achieved: 0.959206349206 ------------ Next iteration:- No of columns projected: 196 Training digit-data transformed to dimension: (29400, 196) Accuracy achieved: 0.958095238095 ------------ Next iteration:- No of columns projected: 222 Training digit-data transformed to dimension: (29400, 222) Accuracy achieved: 0.958650793651 ------------ Next iteration:- No of columns projected: 247 Training digit-data transformed to dimension: (29400, 247) Accuracy achieved: 0.960634920635 ------------ Next iteration:- No of columns projected: 272 Training digit-data transformed to dimension: (29400, 272) Accuracy achieved: 0.96 ------------ Next iteration:- No of columns projected: 297 Training digit-data transformed to dimension: (29400, 297) Accuracy achieved: 0.960158730159 ------------ Next iteration:- No of columns projected: 323 Training digit-data transformed to dimension: (29400, 323) Accuracy achieved: 0.959603174603 ------------ Next iteration:- No of columns projected: 348 Training digit-data transformed to dimension: (29400, 348) Accuracy achieved: 0.961507936508 ------------ Next iteration:- No of columns projected: 373 Training digit-data transformed to dimension: (29400, 373) Accuracy achieved: 0.962222222222 ------------ Next iteration:- No of columns projected: 398 Training digit-data transformed to dimension: (29400, 398) Accuracy achieved: 0.963253968254 ------------ Next iteration:- No of columns projected: 424 Training digit-data transformed to dimension: (29400, 424) Accuracy achieved: 0.961507936508 ------------ Next iteration:- No of columns projected: 449 Training digit-data transformed to dimension: (29400, 449) Accuracy achieved: 0.963888888889 ------------ Next iteration:- No of columns projected: 474 Training digit-data transformed to dimension: (29400, 474) Accuracy achieved: 0.963095238095 ------------ Next iteration:- No of columns projected: 500 Training digit-data transformed to dimension: (29400, 500) Accuracy achieved: 0.963888888889 ------------

This finishes our experiments in random projections.