Due Monday 20 April 2015
The goal of this week's project is to build two simple classifiers that can be trained from data. In particular, you will implement a Naive Bayes classifier and a K-nearest-neighbor [KNN] classifier. Once they are working, build some tools for evaluating the outputs and use your visualization app to look at the results.
- Write the two functions in the Classifier parent class for creating and printing a confusion matirx. The confusion_matrix method should build a numpy matrix showing the number of data points in a category classified as each output category. The confusion_matrix_str method should convert it into a string that does a nice job of printing out the matrix.
Write a python function, probably in a new file, that does the
- Reads in a training set and its category labels, possibly as a separate file.
- Reads a test set and its category labels, possibly as a separate file.
- Builds a classifier using the training set.
- Classifies the training set and prints out a confusion matrix.
- Classifies the test set and prints out a confusion matrix.
- Writes out a new CSV data file with the test set data and the categories as an extra column. Your application should be able to read this file and plot it with the categories as colors.
You will want to be able to use either the Naive Bayes or the KNN classifier for this task. You can create two files, or you can let the user select one or both classifiers from the command line.
- Run the above code on the original Activity Recognition data set. Then run it again on the PCA-transformed version of the data set. Include the confusion matrices in your writeup and note any significant differences.
- Plot the activity recognition data set using the first three PCA axes and use color to show the output labels of the classifier. Include this image in your writeup.
- Repeat the above two exercises on a data set of your choice other than the Iris and Activity Recognition.
Try variations on the training data or the classifiers and compare
performance on the Activity Recognition data set. For example:
- Use more or fewer PCA dimensions.
- Compare using clustering versus the entire data set for the KNN classifier.
- Compare using different numbers of exemplars per class for the KNN classifier.
- Compare using different numbers of neighbors in the distance sum for the KNN classifier.
- Use a method other than K-means clustering to select a subset of exemplar points for KNN classification.
- Implement a different type of classifier.
- Explore more data sets.
For this week's writeup, create a wiki page that shows your classifiers working, the confusion matrices, the plots of classified output, and explains any extensions.
Once you have written up your assignment, give the page the label:
Put your code on the handin server in your private subdirectory.