Principal Components Analysis
The goal of this week's lab is to add the capability to execute PCA on a data set and then create plots based on the analysis.
The first part of the project involves extending your Data and Analysis classes/files. The second part of the project involves integrating the analysis into the GUI.
Read through all of the tasks (lab and project) and plan your design before you start writing code.
Executing a PCA analysis creates three things. First, it generates the eigenvectors, which specify a new basis, or set of axes, for the data. Second, it generates the eigenvalues, which indicate how important each eigenvector is to representing the data. Third, projecting the data from its original data space into the PCA space generates a set of transformed data.
Because the transformed data is, for all intents and purposes, a new data set, it makes sense to use a Data object to hold it. However, we need to extend that Data object to include fields for the mean values, eigenvectors and eigenvalues, as well as information such as the set of columns used for the analysis. Therefore, it makes sense to extend the Data class and create a child class, PCAData class, that will hold the results of a PCA analysis.
Implement a new class PCAData that inherits the data class. You can
put this new class into your Data.py file or create a new file for it.
The class should have fields to hold the the eigenvalues (numpy
matrix), the eigenvectors (numpy matrix), the mean data values (numpy
matrix), and the projected data (numpy matrix). It should also have a
field to hold the headers of the original data columns used to create
the projected data.
The constructor for the PCAData class should be able to take in the projected data, the eigenvectors, the eigenvalues, the original data means, and the original data headers. You will use this capabiity in the analysis.pca method for step 2.
In the constructor you will need to make sure that all of the fields of the Data class get populated when you create the PCAData class, including the field that holds the raw data. The raw data should be a list/string version of the projected data.
In order for the test file to work, your PCAData class has to support the following methods. You can make whatever other methods you feel will be useful.
- get_eigenvalues() - returns a copy of the eigenvalues as a single-row numpy matrix.
- get_eigenvectors() - returns a copy of the eigenvectors as a numpy matrix with the eigenvectors as rows.
- get_data_means() - returns the means for each column in the original data as a single row numpy matrix.
- get_data_headers() - returns a copy of the list of the headers from the original data used to generate the projected data.
Implement a function pca in your Analysis class/file. The
function should take in a list of column headers and return a PCAData
object with the source column headers, projected data, eigenvalues,
eigenvectors, and source data means within it.
Your pca function should also have an optional argument that lets the user choose whether to pre-normalize the data before executing the PCA analysis. By default, that argument should be True.
When the data being used is homogeneous--it all exists in the same units with the same semantic meaning--then normalization is not the correct action. However, when we have heterogenous data that represents different units with different semantic meanings then normalization avoids letting the arbitrary unit designations dominate the PCA analysis.
If the normalization argument is True, then use the normalize_columns_separately function to access the source data. Otherwise, use get_data.
You can use one of two methods to calculate the eigenvectors and eigenvalues: singular value decomposition on the original data matrix, or direct eigenvalue and eigenvector calculation using the covariance matrix of the data. Either method produces the same results.
The following are the two algorithms. Choose one.
# This version uses SVD def pca(d, headers, normalize=True): # assign to A the desired data. Use either normalize_columns_separately # or get_data, depending on the value of the normalize argument. # assign to m the mean values of the columns of A # assign to D the difference matrix A - m # assign to U, S, V the result of running np.svd on D, with full_matrices=False # the eigenvalues of cov(A) are the squares of the singular values (S matrix) # divided by the degrees of freedom (N-1). The values are sorted. # project the data onto the eigenvectors. Treat V as a transformation # matrix and right-multiply it by D transpose. The eigenvectors of A # are the rows of V. The eigenvectors match the order of the eigenvalues. # create and return a PCA data object with the headers, projected data, # eigenvectors, eigenvalues, and mean vector.
# This version calculates the eigenvectors of the covariance matrix def pca(d, headers, normalize=True): # assign to A the desired data. Use either normalize_columns_separately # or get_data, depending on the value of the normalize argument. # assign to C the covariance matrix of A, using np.cov with rowvar=False # assign to W, V the result of calling np.eig # sort the eigenvectors V and eigenvalues W to be in descending order. At # the end of this process, the eigenvectors should be a matrix V with # each eigenvector as a row of the matrix. # assign to m the mean values of the columns of A # assign to D the difference matrix A - m # project the data onto the eigenvectors. Treat V as a transformation # matrix and right-multiply it by D transpose. # create and return a PCA data object with the headers, projected data, # eigenvectors, eigenvalues, and mean vector.
- Test your PCA code. If you run this test file on this data file, then you should get this result.
When you are done with the lab tasks, get started on the rest of the project.