CS 251: Assignment #8

Machine Learning

Due Monday 20 April 2015

The goal of this week's project is to build two simple classifiers that can be trained from data. In particular, you will implement a Naive Bayes classifier and a K-nearest-neighbor [KNN] classifier.


Tasks

When creating a classifier, it is worth considering using a class, since a classifier has both a data and functional components. You can use the following Classifier template file that contains a parent Classifier class and two child classes, one each for the Naive Bayes and the KNN classifiers.

You will want to download the following data sets for ths project. Note that the labels for the training and test sets of the Activity Recognition data set work for both the original data and the eigenvector projected data.

  1. Write the code for building a Naive Bayes classifier: the __init__ and build methods of the NaiveBayes class. Be careful with the categories list, as it is not guaranteed that the labels will be integers or even in the range [0..C-1], where C is the number of categories. Look up the function numpy.unique. If you have an Nx1 matrix of categories, you can get the list of unique labels and an N element list of categories in the range [0, N-1] using the following.

    unique, mapping = np.unique( np.array(categories.T), return_inverse=True)

    When computing the means, variances, and scales, it's also worth looking up the meaning of an expression like A[(mapping==i),:], where mapping is an array of class labels (like the one returned by np.unique). This is a good project to extend your numpy skills (your numpy-fu).

    When you have the build function ready, test it with the complete Iris data set using naivebayes_test1.py. You should get this result.

  2. Write the code for classifying data using the Naive Bayes classifier: the classify method of the NaiveBayes class.

    Again, this is a good opportunity to improve your numpy-fu. Think about the process as having two steps. First you create a matrix P that is N (number of points) by C (number of categories). Each entry in that matrix will hold the probability that the data point belongs to that class. The second step is to find the index of the maximum value in each row, which is the best label for that data point.

    Think about building up the matrix P column by column, or one category at a time. You should be able to calculate the probability that a data point belongs to a class in one (long) line of code. Functions you will want to use include the numpy functions square, exp, multiply, and prod.

    The output of the classify function should be the following.

    • (Required output) A row matrix of categories (Nx1), with values between 0 and C-1, where C is the number of classes. For the Activity Recognition data set, these should be values between 0 and 5. There should be one category value for each data point.
    • (Required output) A row matrix of labels (Nx1), using the original labels used to provide the classes. For example, the original labels for the Activity Recognition data set are numbers from 1 to 6 (not 0 to 5).
    • (Optional output) An N by C matrix giving the value calculated for each data point to each class. For the Activity Recognition data set, this will be N by 6 and will show the probabilities of each of the activity categories for each data point.

    When you have the classify function ready, test it with the Iris training and test data sets using naivebayes_test2.py. You should get this result.


  3. Write the code for building a KNN classifier: the __init__ and build methods of the KNN class. Store the example data points for a class in a matrix, with each point as a row. Store the set of matrices, one for each class, in a list.

    The default classifier should take in the training data and store it as a set of C (number of categories) matrices, where each matrix is the set of points in category i.

    If the build method is given a value for the parameter K, however, then it should execute K-means clustering on each category. In other words, for each category, execute K-means for that category, creating K exemplar data points (the codebook output of the K-means clustering). Store only the codebook returned by K-means for use by the KNN classifier.

    When you have finished the build method, test it with the complete Iris data set using knn_test1.py. You should get this result. Note that the second classifier may be slightly different, since K-means has a randomized starting point.

  4. Write the code for classifying data using a KNN classifier: the classify method of the KNN class. This involves a bit more computation than the Naive Bayes classifier.

    For each class you have a set of exemplar points. You need to calculate the Euclidean distance between each examplar and each data point, which makes a large N x M matrix. Then sort the distances along the column axis and calculate the sum of the first K distances. This will form a column of the overall distance matrix that tells you the distance between each data point and each class.

    The output of the classify function should be the following.

    • (Required output) A row matrix of categories (Nx1), with values between 0 and C-1, where C is the number of classes. For the Activity Recognition data set, these should be values between 0 and 5. There should be one category value for each data point.
    • (Required output) A row matrix of labels (Nx1), using the original labels used to provide the classes. For example, the original labels for the Activity Recognition data set are numbers from 1 to 6 (not 0 to 5).
    • (Optional output) An N by C matrix giving the value calculated for each data point to each class. For the Activity Recognition data set, this will be N by 6 and will show the relative distances to each of the activity categories for each data point.

    When you have the classify function ready, test it using the Iris training and test sets with knn_test2.py. You should get this result. Again, there may be some differences in the results with the second classifier.

When you are done with the lab, go ahead and continue with the project.