Title image Spring 2019

Machine Learning

Due Wednesday May 1, 2019

The goal of this week's project is to build two simple classifiers that can be trained from data. In particular, you will implement a Naive Bayes classifier and a K-nearest-neighbor [KNN] classifier. Once they are working, build some tools for evaluating the outputs and use your visualization app to look at the results.


Tasks

  1. Write the two functions in the Classifier parent class for creating and printing a confusion matirx. The confusion_matrix method should build a numpy matrix showing the number of data points in a category classified as each output category. The confusion_matrix_str method should convert it into a string that does a nice job of printing out the matrix.

    If you run the main test program from the classifier-template.py on the iris data set you should get something that looks like these results. Note that the KNN results may be slightly different depending on the results of your K-means clustering process.

    These are the results on the UCI data set after PCA. Again, the KNN results may be slightly different depending on the K-means clustering process.

  2. Write a python function, probably in a new file, that does the following.
    1. Reads in a training set and its category labels, possibly as a separate file.
    2. Reads a test set and its category labels, possibly as a separate file.
    3. Builds a classifier using the training set.
    4. Classifies the training set and prints out a confusion matrix.
    5. Classifies the test set and prints out a confusion matrix.
    6. Writes out a new CSV data file with the test set data and the predicted categories as an extra column. Your visualization application should be able to read this file and plot it with the categories as colors.

    You will want to be able to use either the Naive Bayes or the KNN classifier for this task. You can create two files, or you can let the user select one or both classifiers from the command line.

  3. Using Naive Bayes, run the above code on the original Activity Recognition data set. Then run it again on the PCA-transformed version of the data set. Repeat with the KNN-classifier using a K of 3. Include the confusion matrices for all four cases in your report. Note any significant differences. Does one classifier do better than the other? Does the original or PCA-transformed data do better? Which classifier/data combination is the best performer? Is there a classifier/data combination that is more efficient/faster/uses fewer features?
  4. Plot the activity recognition data set using the first three PCA axes and use color to show the output labels of the classifier. Include this image in your writeup.
  5. Select a data set of your choice, not the Iris or Activity Recognition sets, that is suitable for classification into categories. The data set should have the following properties.

    • It should have at least three numeric features you can use to predict the category.
    • It should have category labels, either as part of the data or as a separate file.
    • It should have at least 100 data points.

    Divide your data set into a training and a testing set using a 70% training / 30% testing split. Use a randomized procedure to make the split, if the data is not already split.

    Classify your data using both the Naive Bayes and KNN classifiers. Compare the results using a confusion matrix. Plot the results colored by output class using three reasonable axes.


Extensions


Writeup

Make a wiki page for the project report.

Handin

Once you have written up your assignment, give the page the label:

cs251s19project8

Put your code on the handin server in a project8 directory in your private subdirectory.