 Spring 2019

### Machine Learning

The goal of this week's project is to build two simple classifiers that can be trained from data. In particular, you will implement a Naive Bayes classifier and a K-nearest-neighbor [KNN] classifier.

### Tasks

When creating a classifier, it is worth considering using a class, since a classifier has both data and functional components. You can use the following Classifier template file that contains a parent Classifier class and two child classes, one each for the Naive Bayes and the KNN classifiers.

Download the following data sets for ths project. Note that the labels for the training and test sets of the Activity Recognition data set work for both the original data and the eigenvector projected data.

1. Write the code for building a Naive Bayes classifier: the __init__ and build methods of the NaiveBayes class. Be careful with the categories list, as it is not guaranteed that the labels will be integers or even in the range [0..C-1], where C is the number of categories. Look up the function numpy.unique. If you have an Nx1 matrix of categories, you can get the list of unique labels and an N element list of categories in the range [0, N-1] using the following.

unique, mapping = np.unique( np.array(categories.T), return_inverse=True)

When computing the means, variances, and scales, it's also worth looking up the meaning of an expression like A[(mapping==i),:], where mapping is an array of class labels (like the one returned by np.unique). This is a good project to extend your numpy skills (your numpy-fu).

In addition, you will need to compute and store the priors for each category. The probability of a category should be the number of examples of the category divided by the total number of examples. In a more general system, the prior probabilities might be provided separately because the number of samples of each category in out training data might not be representative of the prior probability of the category.

When you have the build function ready, test it with the Iris data set using naivebayes_test1.py. You should get something like this result for the training set and this result for the whole data set.

2. Write the code for classifying data using the Naive Bayes classifier: the classify method of the NaiveBayes class.

Again, this is a good opportunity to improve your numpy skills. Think about the process as having two steps. First you create a matrix P that is N (number of points) by C (number of categories). Each entry in that matrix will hold the likelihood that the data point was generated by its class.

Think about building up the matrix P column by column, or one category at a time. You should be able to calculate the probability that a data point belongs to a class in one (long) line of code. Functions you will want to use include the numpy functions square, exp, multiply, and prod.

The second step is to multiply the likelihood of each category by its prior probability.

The final step is to find the index of the maximum value in each row, which is the best label for that data point.

The output of the classify function should be the following.

• (Required output) A column matrix of categories (Nx1), with values between 0 and C-1, where C is the number of classes. For the Activity Recognition data set, these should be values between 0 and 5. There should be one category value for each data point.
• (Required output) A column matrix of labels (Nx1), using the original labels used to provide the classes. For example, the original labels for the Activity Recognition data set are numbers from 1 to 6 (not 0 to 5).
• (Optional output) An N by C matrix giving the value calculated for each data point to each class. For the Activity Recognition data set, this will be N by 6 and will show the probabilities of each of the activity categories for each data point.

When you have the classify function ready, test it with the Iris training and test data sets using naivebayes_test2.py. You should get this result.

3. Write the code for building a KNN classifier: the __init__ and build methods of the KNN class. Store the example data points for a class in a matrix, with each point as a row. Store the set of matrices, one for each class, in a list.

The default classifier should take in the training data and store it as a set of C (number of categories) matrices, where each matrix is the set of points in category i.

If the build method is given a value for the parameter K, however, then it should execute K-means clustering on each category. In other words, for each category, execute K-means for that category, creating K exemplar data points (the codebook output of the K-means clustering). Store only the codebook returned by K-means for use by the KNN classifier.

When you have finished the build method, test it with the complete Iris data set using knn_test1.py. You should get this result. Note that the second classifier may be slightly different, since K-means has a randomized starting point.

4. Write the code for classifying data using a KNN classifier: the classify method of the KNN class. This involves a bit more computation than the Naive Bayes classifier.

For each class you have a set of exemplar points. You need to calculate the Euclidean distance between each examplar and each data point, which makes a large N x M matrix. Then sort the distances along the column axis and calculate the sum of the first K distances. This will form a column of the overall NxC distance matrix that tells you the distance between each data point and each class.

The output of the classify function should be the following.

• (Required output) A column matrix of categories (Nx1), with values between 0 and C-1, where C is the number of classes. For the Activity Recognition data set, these should be values between 0 and 5. There should be one category value for each data point.
• (Required output) A column matrix of labels (Nx1), using the original labels used to provide the classes. For example, the original labels for the Activity Recognition data set are numbers from 1 to 6 (not 0 to 5).
• (Optional output) An N by C matrix giving the value calculated for each data point to each class. For the Activity Recognition data set, this will be N by 6 and will show the relative distances to each of the activity categories for each data point.

When you have the classify function ready, test it using the Iris training and test sets with knn_test2.py. You should get this result. Again, there may be some differences in the results with the second classifier.

When you are done with the lab, go ahead and continue with the project.