CS 251: Assignment #8


Due Thursday, 21 April, 2011


  1. Add to your DataSet class the ability to write out data as an arff file, suitable for use with the WEKA explorer program. Your function should probably take in a list of which columns to write to the arff file. Your program does not have to support anything but numeric data, and this does not have to be integrated with your GUI.
  2. Download the following two data sets, which contain training and testing data for a handwritten digit recognition task in csv format, suitable for your programs. The meta-data is provided as the third link. The data consists of 64 integers in the range [0, 16] and the last column of data indicates which digit it is (the label).

  3. Using your visualization and analysis program, execute PCA on the digit training set. Make sure your PCA analysis does not include the last column of data, which is the label. Do not normalize your data before executing the PCA, as all of the data has the same units and meaning.

    Project the data onto the eigenvectors, and then visualize the data using the first three eigenvectors. Color the plot by the labels.

    You may want to add to your GUI that capability to write out the results of the PCA analysis. Alternatively, you can do the next task as a standalone command-line program.

  4. Identify how many eigenvectors you need to use to represent 95% of the variation in the data set. Write out the projected data values as a arff file, along with the label column from the original data set. If you kept 6 eigenvectors, then your resulting file should have seven columns: the 6 projected values plus the label.

    Note, you do not need to integrate this task with your GUI. You can write a separate command line program that uses your DataSet class to handle the task.

  5. Using the Weka explorer, do the following with both the original digits training set and the projected training set.
    • Create a decision tree to classify the data into digits.
    • Pick one other classifier and train it to classify the data into digits.
    • Generate a confusion matrix for each classifier using the test set.


  1. Test out other classifiers or variations of them in Weka Explorer.
  2. Try clustering the digits data using the projected version of the data. Do a comparison of the cluster labels with the actual digit labels and see if there is any correlation.
  3. Try using more or fewer eigenvectors to train the classifiers.
  4. Do this type of analysis with a different data set. You can use the one you collected earlier in the semester or download a data set from the UCI ML Repository.
  5. Integrate all of the tasks into your GUI.


Give a brief description of the task and how you solved it. Include any decisions you had to make, including which classifiers you used, how many eigenvectors you used, etc.

Include one or more plots of the digit data in the PCA space.

Include the confusion matrices for the two classifiers on the test set data, both the original data and the projected data.

Briefly discuss the quality of the performance of each classifier on the task and how the PCA affected performance.

Describe any extensions you undertook.


Once you have written up your assignment, give the page the label:


Put your code in the COMP/CS251/yourname/private/ folder on fileserver1/Academics in a project8 folder.