CS 251: Assignment #5

Principal Components Analysis

Due Friday 22 March 2013

The goal of this week's lab is to add the capability to execute PCA on a data set and then create plots based on the analysis.


Read through all of the tasks and plan your design before you start writing code.

The first part of the project involves extending your Data and Analysis classes/files. The second part of the project involves integrating the analysis into the GUI.

Executing a PCA analysis creates three things. First, it generates the eigenvectors, which specify a new basis, or set of axes, for the data. Second, it generates the eigenvalues, which indicate how important each eigenvector is to representing the data. Third, projecting the data from its original data space into the PCA space generates a set of transformed data.

Because the transformed data is, for all intents and purposes, a new data set, it makes sense to use a Data object to hold it. However, we need to extend that Data object to include fields for the mean values, eigenvectors and eigenvalues, as well as information such as the set of columns used for the analysis. Therefore, it makes sense to extend the Data class and create a child class, PCAData class, that will hold the results of a PCA analysis.

  1. Implement a new class PCAData that inherits the data class. You can put this new class into your Data.py file or create a new file for it. The class should have fields to hold the list of column IDs used to create the data, the eigenvalues (numpy matrix or list), the eigenvectors (numpy matrix), the mean data values (numpy matrix), and the projected data (numpy matrix).

    You will need to make sure that all of the fields of the Data class get populated when you create the PCAData class, including the field that holds the raw data.

    In order for my test file to work, your PCAData class has to support the following methods. You can make whatever other methods you feel will be useful.

    • getData() - returns a matrix with the data
    • getMeans() - returns a matrix with the means for each column
    • getEigenvalues() - returns a matrix with the eigenvalues in a single row
    • getEigenvectors() - returns a matrix with the eigenvectors as columns
  2. Implement a function pca in your Analysis class/file. The function should take in a list of column IDs and return a PCAData object with the set of source columns, eigenvalues, eigenvectors, and projected data within it.

    If you run this test file on this data file, then you should get this result.

  3. In your GUI, enable the user to pick a data file, execute a PCA analysis, store the result, and then create an entry in a listbox that links to the result. The basic capability should use all of the numeric columns of the selected data set. An extension is to allow the user to pick and choose columns.
  4. In your GUI, enable the user to select an analysis from the listbox and view the data projected onto the first three eigenvectors. An extension is to allow the user to pick the columns to plot.
  5. In your GUI, somehow enable the user to see the eigenvectors and eigenvalues of a selected PCA analysis. You can use plots, project the eigenvectors onto the original data, or just throw up a window with the numeric values in a table.
  6. Come up with an acronym or name for your program. Be creative. The success of your program may, in the end, be completely determined by how cool your acronym is. Then again, it's success may have something to do with the quality of your work. But it never hurts to have a cool name.



Write a brief description of how you implemented the PCA algorithm and modified your Data and Application classes. Incorporate screen shots showing a visualization of the provided data set and another data set of your choice.


Once you have written up your assignment, give the page the label:


Put your code your private handin directory on Courses. Please make sure you are organizing your code by project. If you have any problems uploading the code, send the prof a zip file.