CS 251: Lab #5

Lab 5: Principal Component Analysis

Project due Friday night Mar 20, 2014

The goal of this week's lab is to add the capability to execute a principal component analysis (PCA) on a data set.


Tasks

Executing a PCA analysis creates three things. First, it generates the eigenvectors, which specify a new basis, or set of axes, for the data. Second, it generates the eigenvalues, which indicate how important each eigenvector is to representing the data. Third, projecting the data from its original data space into the PCA space generates a set of transformed data.

Because the transformed data is, for all intents and purposes, a new data set, it makes sense to use a Data object to hold it. However, we need to extend that Data object to include fields for the mean values, eigenvectors and eigenvalues, as well as information such as the set of columns used for the analysis. Therefore, it makes sense to extend the Data class and create a child class, PCAData class, that will hold the results of a PCA analysis.

  1. Implement a new class PCAData that is derived from the Data class. You can put this new class into your data.py file or create a new file for it. The class should use the same fields as the parent class to store the original and projected data (treated as one set of numeric columns). It should add fields to store the eigenvalues (numpy matrix or list), the eigenvectors (numpy matrix), the mean data values (numpy matrix), and the projected data (numpy matrix).

    You will need to make sure that all of the fields of the Data class get populated when you create the PCAData class, including the field that holds the raw data.

    For test files to work, your PCAData class must support the following methods. You can make whatever other methods you feel will be useful.

    • get_eigenvalues() - Return a copy of the eigenvalues in a numpy matrix. The return value is a single-row matrix.
    • get_eigenvectors() - Return a copy of the eigenvectors in a numpy matrix. Each eigenvector is a column.
    • get_data_mean() - Return a single-row matrix containing the mean of each numeric column in the original data.
    • get_pca_headers() - Return a list of strings naming the headers of the columns that contain the data projected onto the principal components.
    • get_data_headers() - Return a list of strings naming the headers of the columns that contain the original data.

    If you would like to copy/paste/modify Stephanie's code, it is here. If you have used the suggested field names and meanings given in the instructions for project 2, this will be straight forward. If not, it may be easier to write your own code from scratch.

    Test your PCAData class with this file.

  2. Implement a function/method pca in your analysis file/class. The function should take in a list of column IDs and return a PCAData object with the set of source columns, eigenvalues, eigenvectors, and projected data within it.

    If you run this test file on this data file, then you should get this result. Note that you will need to make some modifications to the test code if you have used different methods/fields in your Data class than those used by Stephanie.


When you are done with the lab exercises, you may start on the rest of the project.