Due Wednesday May 1, 2019
The goal of this week's project is to build two simple classifiers that can be trained from data. In particular, you will implement a Naive Bayes classifier and a K-nearest-neighbor [KNN] classifier. Once they are working, build some tools for evaluating the outputs and use your visualization app to look at the results.
Write the two functions in the Classifier parent class for
creating and printing a confusion matirx. The confusion_matrix
method should build a numpy matrix showing the number of data
points in a category classified as each output category. The
confusion_matrix_str method should convert it into a string that
does a nice job of printing out the matrix.
If you run the main test program from the classifier-template.py on the iris data set you should get something that looks like these results. Note that the KNN results may be slightly different depending on the results of your K-means clustering process.
These are the results on the UCI data set after PCA. Again, the KNN results may be slightly different depending on the K-means clustering process.
Write a python function, probably in a new file, that does the
- Reads in a training set and its category labels, possibly as a separate file.
- Reads a test set and its category labels, possibly as a separate file.
- Builds a classifier using the training set.
- Classifies the training set and prints out a confusion matrix.
- Classifies the test set and prints out a confusion matrix.
- Writes out a new CSV data file with the test set data and the predicted categories as an extra column. Your visualization application should be able to read this file and plot it with the categories as colors.
You will want to be able to use either the Naive Bayes or the KNN classifier for this task. You can create two files, or you can let the user select one or both classifiers from the command line.
- Using Naive Bayes, run the above code on the original Activity Recognition data set. Then run it again on the PCA-transformed version of the data set. Repeat with the KNN-classifier using a K of 3. Include the confusion matrices for all four cases in your report. Note any significant differences. Does one classifier do better than the other? Does the original or PCA-transformed data do better? Which classifier/data combination is the best performer? Is there a classifier/data combination that is more efficient/faster/uses fewer features?
- Plot the activity recognition data set using the first three PCA axes and use color to show the output labels of the classifier. Include this image in your writeup.
Select a data set of your choice, not the Iris or Activity
Recognition sets, that is suitable for classification into
categories. The data set should have the following properties.
- It should have at least three numeric features you can use to predict the category.
- It should have category labels, either as part of the data or as a separate file.
- It should have at least 100 data points.
Divide your data set into a training and a testing set using a 70% training / 30% testing split. Use a randomized procedure to make the split, if the data is not already split.
Classify your data using both the Naive Bayes and KNN classifiers. Compare the results using a confusion matrix. Plot the results colored by output class using three reasonable axes.
Try variations on the training data or the classifiers and compare
performance on the Activity Recognition data set. For example:
- Use more or fewer PCA dimensions.
- Compare using clustering versus the entire data set for the KNN classifier.
- Compare using different numbers of exemplars per class for the KNN classifier.
- Compare using different numbers of neighbors in the distance sum for the KNN classifier.
- Compare using different distance metrics.
- Use a method other than K-means clustering to select a subset of exemplar points for KNN classification.
- Implement a different type of classifier.
- Explore more data sets.
- Integrate machine learning analysis into your GUI. Be very careful and intentional if you do this extension. Think for a while about your design before writing a single line of code to implement it.
Make a wiki page for the project report.
- Write a brief summary, separate from the body of your report, of your project that describes the purpose, the task, and your solution to it. It should describe the task, the key parts of your solution, and the result of your work (did it work, what can you do with your GUI?). The summary should be 200 words or less.
- Write a brief description of how you implemented the two classifiers and the results on the test data sets Iris and Activity Recognition.
- Incorporate screen shots showing a visualization of the test data sets using your GUI. Color the plots by output category.
- Report on the confusion matrices for the two standard data sets and answer the questions posed in task 3.
- Describe the data set you chose, which features you used
as predictors and what you were trying to predict/classify.
Include a confusion matrix and an analysis of the results.
Were your classifiers successful?
These are some guidelines to consider when writing this section of your report.
- Did this analysis help you learn anything from your data set?
- If you learned something, summarize your results and explain what you learned. If not, why not? For example, was the method not appropriate for your data, was the implementation incorrect, or were the results inconclusive?
- Are all of your scatter plot results properly labeled? Is it clear which features are plotted along each axis It is OK if the infomration is in the text instead of on the image.
- Are all numeric results properly labeled? Is it clear what each number represents and what its unit is? Is the label itself clearly explained?
- Are numeric results presented in a concise, easy-to-read manner? For example, if there are most than 3 related numbers, a table or graph might be more appropriate than having the numbers directly in the text.
When discussing the results on your own data, consider
the following general questions.
- Is the method of visualization explained and justified? Which axes were chosen? Is it a scatter plot? Why do these features make sense for your analysis?
- Is the method of analysis clearly identified and explained? Which algorithm was applied, and how does it work?
- What do you expect to learn from this kind of analysis? What kinds of information does this algorithm extract?
- Why is this method of analysis a good fit for your dataset? Why do expect this algorithm to help you discover something about your data?
- Are there any characteristics of this method of analysis that had to be taken into consideration in order to visualize it in a useful way? Did you have to add features to your GUI to display the results?)
- Be sure to document and describe any extensions.
- Summarize what you learned and identify any collaborators/assistance.
Once you have written up your assignment, give the page the label:
Put your code on the handin server in a project8 directory in your private subdirectory.