Title image Spring 2018

Principal Components Analysis

Due Monday 9 April 2018

The goal of this week's lab is to give your GUI the capability to execute PCA on a data set and then create plots based on the analysis.


Tasks

  1. In your GUI, enable the user to pick a data file, execute a PCA analysis, store the result, and then create an entry in a listbox that links to the result. The basic capability should allow the user to pick and choose which columns of the original data to use in the PCA analysis. Each new analysis should show up as a new entry in the analysis list box. Allow the user to delete an existing analysis.
  2. In your GUI, enable the user to select an analysis from the listbox and view the data projected onto the first three eigenvectors. An extension is to allow the user to pick the columns to plot. In general, you do not want to normalize separately the transformed PCA data before plotting, but doing so is fine.
  3. In your GUI, somehow enable the user to see the eigenvectors and eigenvalues of a selected PCA analysis. For example, show them in a dialog window as a table, as shown below.

    Eigenvector Table

    You probably want to use the tkinter grid layout method in order to build the table.

    Note the second and third columns, which show the eigenvalues and the cumulative percentage of the eigenvalues from largest to smallest. In this case, the first five eigenvectors explain 92% of the variation in the data set.

  4. Using the Australia Coast data set, compute the PCA analysis on the columns: premin, premax, salmin, salmax, minairtemp, maxairtemp, minsst, maxsst, minsoilmoist, maxsoilmoist, and runoffnew. Then show a spatial plot of the data projected onto the first three eigenvectors. The plot should look something like the following.

    PCA plot

  5. Using a data set of your choice, execute a PCA analysis using at least three dimensions of data. If you wish, you can use the Iris data set, which is simple and has a nice PCA result. Do not use the iris "class" feature in the PCA analysis. In your report, discuss/include the following items.

    • Is your data homogeneous or heterogeneous? (If heterogeneous, be sure to normalize the data prior to executing the PCA)
    • How many significant dimensions exist in the features you chose for the PCA analysis? Another way of asking the question is how many eigenvectors are required to represent at least 90\% of the data variation?
    • Look at the first eigenvector. What dimensions are the primary contributors to it (have the largest coefficients)? Are those dimensions negatively or positively correlated? Does the result makes sense?
    • Show a plot of your transformed data using the first three eigenvectors.

  6. Come up with an acronym or name for your program. Be creative. The success of your program may, in the end, be completely determined by how cool your acronym is. Then again, it's success may have something to do with the quality of your work. But it never hurts to have a cool name.

Extensions


Report

Make a wiki page for the project report.

Handin

Once you have written up your assignment, give the page the label:

cs251s18project6

Put your code your private handin directory on Courses. Please make sure you are organizing your code by project in the Private subdirectory.