CS 251: Assignment #7

### Data Analysis: PCA and Clustering

Due Thursday, 14 April 2011

1. Implement PCA analysis for the numeric components of a data set in your DataSet class. You should add three functions to your DataSet class.
• pca: the function should return all of the the eigenvectors and eigenvalues of the numeric data, preferably in sorted order. A good way to structure the return value is as a list of duples, where each duple consists of the eigenvalue and its corresponding eigenvector.

Before executing the eigenvector analysis, normalize each individual variable either by subtracting its mean and dividing by the standard deviation or by subtracting the minimum value and dividing by the range (max - min).

Use the numpy.cov function to calculate the covariance matrix, and use the numpy.linalg.eig function to calculate the eigenvalues and eigenvectors.

project: given a list of vectors (e.g. a subset of the eigenvectors from above), or a numpy matrix with each row consisting of a vector (e.g. eigenvector), the function should return a matrix that is the projection of the data set onto the set of basis vectors. To project a data point, subtract the mean data value and then take the dot product with each of the N basis vectors. That produces N numbers, which is the new coordinate of the projected data point in the eigenspace.

buildPCA: the function should concatenate the two steps above and return a new DataSet class that holds the projected data and has new header values (like EV1, EV2, EV3) and so on.

Test out your functions on testdata1.csv and on the Australia Coast data set. Run PCA on the Australia Coast data set, then try projecting the data set onto the three most important eigenvectors and then visualize it with your program.

For comparison, here are the eigenvectors, eigenvalues, and projected data for testdata1.csv using both unnormalized data and normalized data, using the max/min normalization method. Note the ordering of the eigenvalues, and therefore the projected data values by decreasing eigenvalue.

2. Implement K-means clustering for the numeric components of a data set. This should also be a function in your DataSet class. It needs to take in which columns to use while clustering and the number of clusters. Note that there is a numpy function that will do labeling based on Euclidean distance for and entire matrix of data points.

Use standard Euclidean distance for your base implementation, but plan your algorithm so that the distance calculation is a separate furnction

You can use this simple data set to debug and evaluate your k-means algorithm.

3. Tie the PCA and clustering together so that you can execute a PCA analysis, optionally select a subset of the eigenvectors for the projection (like 3 or fewer), project the data, and then cluster the projected data.

Demonstrate this on the Australia Coast data. Run PCA on all of the variables. Keep the three most important eigenvectors, and project the data into that 3-D eigenspace. Run k-means clustering with K = 10, and then color the visualization by cluster ID.

Note that you do not have to have it all integrated into your visualization system, as you can do the analysis separately and then have your visualization program read in the projected data along with the cluster IDs.

4. Work through the WEKA tutorial with some of the sample data sets provided. You can download and install WEKA on your own computer from the WEKA site. At the end of your writeup, give a brief summary of your comparative results using two different classifiers on one of the sample data sets.

### Extensions

• Integrate the PCA and clustering into your visualization program.
• After running PCA, show the top 3 eigenvectors in your interactive visualization of the data set. This is effectively a second set of axes in your visualization, but instead of being [1, 0, 0], [0, 1, 0], and [0, 0, 1], they will be the eigenvectors.
• Have your clustering algorithm repeat the K-means clustering some number of times and keep the best result (the one with the smallest representation error, which is the sum squared error between each data point and its closest cluster mean).
• Try clustering the mean/stdev values for the bird arrival data set for task 4.1 last week.
• Try clustering (with or without doing PCA first) on other data sets, like the one you collected in week 1. Does it tell you anything useful?

### Writeup

For this writeup, summarize the PCA process and show the eigenvalues and eigenvectors for the test data set (numerical values). Also include in your writeup a picture of the visualization of the Australia Coast data set projected to a 3-dimensional space.

Discuss any special features of your K-means algorithm and describe any extensions you did.

### Handin

Once you have written up your assignment, give the page the label:

cs251s11project7

Put your code in the COMP/CS251 folder on fileserver1/Academics. Please make sure you are organizing your code by project.