A number of open-source communities are using python to make available artificial intelligence and machine learning related packages and libraries. A list of communities and packages is available at PythonForArtificialIntelligence. In this blog I will use libraries from scikit-learn.
About sklearn, classes & datasets
Project scikit-learn is a Machine Learning Project in Python. It has a good collection of algorithms for some of the well known data-mining and data analysis jobs such as for Classification, Regression, Clustering, Dimensionality reduction and Model Selection. These algorithms are constructed on a stack of NumPy, SciPy library, and matplotlib. All these are part of SciPy.org ecosystem.
NumPy serves as a base package for scientific and mathematical computing. It has very powerful and efficient N-dimensional array manipulation capabilities. You can read it as Numeric arrays in Python. SciPy library is a collection of scientific and numerical computing tools. Numerical computing uses algorithmic and numerical approximation (rather than equation solving) approach for solving mathematical problems. For some of the complex mathematical problems, numerical solutions are easier, fast and accurate. For example, it is better to use an algorithmic approach of gradient descent to discover the minimum coordinates of a multidimensional surface rather than first mathematically describe the surface and then solve a complex set of equations to find its global or local minima. Matplotlib is a python 2-D plotting library for rendering high quality graphs.
Various python distributions are available that can install SciPy stack or SciPy.org ecosystem. A list is available here. One easy way to install the SciPy stack is through anaconda python distribution. Anaconda also installs IPython and pandas. Installation can be on Windows, Linux or Mac.
sklearn comes with a handy dataset module ‘sklearn.datasets‘. The module has few pre-installed small datasets and accompanying classes to load these datasets. It has classes to fetch some popular datasets from repositories such as mldata.org. It also has utilities to generate fictitious datasets for practice.
‘sklearn.neighbors‘ module provides unsupervised and supervised neighbors-based learning methods. Among these are classification and regression tools. You can have a quick look at all the classes to perform classification and other tasks here. In fact this is a one-page brief Reference for all the machine learning modules and classes under sklearn. I will strongly recommend that you visit this page if you are seriously learning python for machine learning using sklearn. Specifically, there are three classes that implement supervised classification algorithm based on nearest neighbor approach, namely, KNeighborsClassifier, RadiusNeighborsClassifier and NearestCentroid classifier. KNeighborsClassifier implements classification based on voting by nearest k-neighbors of target point, t, while RadiusNeighborsClassifier implements classification based on all neighborhood points within a fixed radius, r, of target point, t. In NearestCentroid classifier, each class is represented by the centroid of its members; thus the target point will be member of that class whose centroid is nearest to it. NearestCentroid algorithm is the simplest of the three and has no parameters to select from. Its results can be taken as the benchmark for evaluation purposes. Brief details about each of these three are given here.
I have described K-Nearest Neighbor algorithm in my earlier blog. You can read about it also in Wikipedia. In this blog I will use KNeighborsClassifier class for classification purposes. Parameters (and their default values) in this class are as follows:
KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, metric=’minkowski’, p=2, metric_params=None)
‘n_neighbors‘ are the number of neighbors that will vote for the class of the target point; default number is 5. An odd number is preferred to avoid any tie. ‘weights‘ parameter has two choices: ‘uniform‘ and ‘distance‘. For the ‘uniform‘ weight, each of the k neighbors has equal vote whatever its distance from the target point. If the weight is ‘distance‘ then voting weightage or importance varies by inverse of distance; those points who are nearest to the target point have greater influence than those who are farther away. Parameter ‘algorithm‘ is for selecting the indexing data structure that will be used for speeding up neighborhood search; value of ‘auto‘ leaves it to algorithm to make the best choice among the three. I have described the three algorithms, brute, kd_tree and ball_tree in my earlier blog. Parameter ‘leaf_size‘ is the size of leaf in kd_tree or ball_tree. Larger the size, greater the speed of initial indexing structure formation but at the cost of delay in classification of target point. Parameter ‘metric‘ decides how distances are calculated in space. One familiar way is euclidean distance but then in some cases other measures of distances such as Manhattan distance are also used. A general formulation of distance metric is ‘minkowski’ distance. When parameter ‘p‘ is 2, it is the same as euclidean distance and when parameter ‘p‘ is 1, it is Manhattan distance. You can read more about them here. Last parameter ‘metric_params‘ is to provide any additional arguments to metric function.
Thus, class KNeighborsClassifier can be used without explicitly specifying the value of any parameter. Default values are already supplied. This makes it easy to use it for initial learning purposes and then to add parameter values, one by one. This is precisely what I will do. We will use iris dataset for our demo program. Iris dataset is available in sklearn itself or you can download it from UCI machine learning repository from here. A simple program is as below:
import numpy as np from sklearn import neighbors, datasets # Load iris data from 'datasets module' iris = datasets.load_iris() # Get data-records and record-labels in arrays X and y X=iris.data y=iris.target</pre> # Create an instance of KNeighborsClassifier and then fit training data clf = neighbors.KNeighborsClassifier() clf.fit(X, y) # Make class predictions for all observations in X Z = clf.predict(X) # Compare predicted class labels with actual class labels accuracy=clf.score(X,y) print ("Predicted model accuracy: "+ str(accuracy)) # Add a row of predicted classes to y-array for ease of comparison A = np.vstack([y, Z]) print(A)
Instead, if you so like, you can download the iris dataset from UCI repository. Iris data, appears as follows; there are three classes of flowers: Iris-setosa, Iris-versicolor and Iris-virginica.
sepal_length,sepal_width,petal_length,petal_width,class 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 7.0,3.2,4.7,1.4,Iris-versicolor 6.4,3.2,4.5,1.5,Iris-versicolor 6.9,3.1,4.9,1.5,Iris-versicolor 5.5,2.3,4.0,1.3,Iris-versicolor 6.3,3.3,6.0,2.5,Iris-virginica 5.8,2.7,5.1,1.9,Iris-virginica 7.1,3.0,5.9,2.1,Iris-virginica
We will read data using pandas and analyse it. The code for this is as follows.
import numpy as np from sklearn import neighbors import pandas as pd # Read from downloaded iris data file df = pd.read_table('/home/ashokharnal/Documents/iris.data',sep=",") # Separate four data attributes and class data (the 5th attribute) # Slice data-frame column wise. When slicing the data frame using iloc, # the start bound (0) is included, while the upper bound (4) is excluded. X = df.iloc[:,0:4] # X includes columns 0,1,2,3 y = df['class'] # Get last column clf = neighbors.KNeighborsClassifier() clf.fit(X, y) Z = clf.predict(X) accuracy=clf.score(X,y) print ("Predicted model accuracy: "+ str(accuracy)) # Type of Z is numpy ndarray. Add, Z, to iris data frame as last column df['Z']=Z # Compare two classes: actual and predicted df.iloc[:,4:6]
Let us add a bit of complexity to above code. We will use only two of Iris-data attributes for predicting flower-classes. On X-Y plane we plot the two attributes; divide the area (positive quadrant) using a grid; evaluate the class of each point on the grid and color that cell with one of the three colors to display its class. Since there are three classes, there will be three class-centroids and certain adjacent points will have one color (class) and far away points will have different color (class). This way we can demarcate class boundaries. The following code does this.
import numpy as np import pylab as pl # for basic plotting jobs from matplotlib.colors import ListedColormap # for mapping colours to an array of values from sklearn import neighbors import pandas as pd df = pd.read_table('/home/ashok/Documents/iris.data',sep=",") # Recode three class values: "Iris-setosa" as 0, Iris-versicolor as 1, Iris-virginica as 2 df['class'].replace('Iris-virginica',2,inplace=True) df['class'].replace('Iris-versicolor',1,inplace=True) df['class'].replace('Iris-setosa',0,inplace=True) # Slice df vertically to include only cols 0 & 1 X = df.iloc[:, :2] # If X is numpy array then X = [:, :2] y = df['class'] # Get last column # Create list of three colors (corresponding to three class values) light_colors = ListedColormap(['blue', 'c', 'g']) # 'c' for cyan and 'g' for green bold_colors = ListedColormap(['r', 'k', 'yellow']) # 'r' for red and 'k' for black light_colors.colors # Just check colors in the list # Modelling clf = neighbors.KNeighborsClassifier() clf.fit(X, y) # Fix four corners of grid boundaries: Corners are # as per min and max values of each of the two Iris attributes x_min, x_max = X.iloc[:, 0].min() - 1, X.iloc[:, 0].max() + 1 y_min, y_max = X.iloc[:, 1].min() - 1, X.iloc[:, 1].max() + 1 # Create a mesh with bottom-left corner: (x_min,y_min) & # top-right corner: (x_max,y_max). # cell width & height: h. (Larger h leads to coarser class-boundaries) # 'meshgrid' is very useful to evaluate functions on a grid. h = 0.02 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) # You could display the grid with the following code # pl.plot(xx,yy,'r.') # pl.show() # Use either of the following functions to flatten the multidimensional arrays # of xx & yy and make class predictions on those coordinates Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) Z1= clf.predict(np.c_[xx.flatten(), yy.flatten()]) Z2= clf.predict(np.c_[xx.reshape(len(xx)*len(xx),1), yy.reshape(len(yy)*len(yy),1)]) # Print Z values to understand the output print(Z) print(Z1) print(Z2) # Reshape Z as per xx array Z = Z.reshape(xx.shape) Z1=Z1.reshape(xx.shape) Z2=Z2.reshape(xx.shape) # Create figure in memory with default parameter values pl.figure() # Color every cell defined by arrays xx & yy as per values of Z # There are a total of xx*yy cells. Z is also of same size. # Each cell is colored in either of three colors as per corresponding value of Z (class). pl.pcolormesh(xx, yy, Z, cmap=light_colors) # Invoke only one of the functions pl.pcolormesh(xx, yy, Z1, cmap=light_colors) pl.pcolormesh(xx, yy, Z2, cmap=light_colors) # Plot Iris attribute coordinates on demarcated boundaries pl.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, cmap=bold_colors) pl.title("Iris-classification (weights = '%s')" % ('uniform')) pl.axis('tight') pl.show()
We will complicate the code further. We will now vary all parameters in class KNeighborsClassifier. Parameter values are as below:
n_neighbors: Values 3 or 15 weights: Values 'uniform' or 'distance' algorithm: Values 'auto', 'brute', 'ball_tree' or 'kd_tree' p: Values 1 or 2
We have altogether 2*2*4*2 = 32 choices. We use ‘for’ loop to cycle through these 32 combinations, one by one, evaluate accuracy and draw decision boundaries for each. The following code performs this.
import numpy as np import pylab as pl from matplotlib.colors import ListedColormap from sklearn import neighbors, datasets from sklearn.cross_validation import train_test_split # import some data to play with iris = datasets.load_iris() X = iris.data[:, :2] y = iris.target h = .01 # Create color maps from a list of colors light_colors = ListedColormap(['blue', 'c', 'g']) bold_colors = ListedColormap(['r', 'k', 'yellow']) # uniform and distance are two arguments for n_neighbors in [3,15]: for distancemetric in [1,2]: for algorithms in ['auto', 'ball_tree', 'kd_tree', 'brute']: for weights in ['uniform', 'distance']: if (distancemetric == 1): d_metric="Manhattan distance" else: d_metric="Euclidean distance" clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights,algorithm=algorithms,p=distancemetric ) clf.fit(X, y) x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) accuracy=clf.score(X,y) print("No of neighbors: "+str(n_neighbors)+", Distance metric: "+d_metric+", Algorithm is: " + algorithms + ", weights: "+ weights+ ", Accuracy is: "+ str(accuracy)) Z = Z.reshape(xx.shape) pl.figure() pl.pcolormesh(xx, yy, Z, cmap=light_colors ) # Plot also the data points pl.scatter(X[:, 0], X[:, 1], c=y, cmap=bold_colors) pl.title("3-Class classification (k = %i, weights = '%s', algorithms ='%s',distance_metric= '%s')" % (n_neighbors, weights,algorithms,d_metric)) pl.axis('tight') pl.show()
In this blog, I have picked up example code liberally from sklearn site here. However, as for a novice that example is a little difficult to understand, I have tried to explain it here in steps. Also, there is a bit of amplification. Hope, it is useful. Bye!
Tags: Classification using python, K-nearest neighbor classification example using python, K-NN in python example, KNeighborsClassifier explained, sklearn.neighbors, tutorial in sklearn and machine learning using python