Principal Components Analysis
Due Monday 9 April 2018
The goal of this week's lab is to give your GUI the capability to execute PCA on a data set and then create plots based on the analysis.
Tasks
- In your GUI, enable the user to pick a data file, execute a PCA analysis, store the result, and then create an entry in a listbox that links to the result. The basic capability should allow the user to pick and choose which columns of the original data to use in the PCA analysis. Each new analysis should show up as a new entry in the analysis list box. Allow the user to delete an existing analysis.
- In your GUI, enable the user to select an analysis from the listbox and view the data projected onto the first three eigenvectors. An extension is to allow the user to pick the columns to plot. In general, you do not want to normalize separately the transformed PCA data before plotting, but doing so is fine.
-
In your GUI, somehow enable the user to see the
eigenvectors and eigenvalues of a selected PCA analysis.
For example, show them in a dialog window as a table, as
shown below.
You probably want to use the tkinter grid layout method in order to build the table.
Note the second and third columns, which show the eigenvalues and the cumulative percentage of the eigenvalues from largest to smallest. In this case, the first five eigenvectors explain 92% of the variation in the data set.
-
Using the Australia Coast data set,
compute the PCA analysis on the columns: premin, premax, salmin,
salmax, minairtemp, maxairtemp, minsst, maxsst, minsoilmoist,
maxsoilmoist, and runoffnew. Then show a spatial plot of the data
projected onto the first three eigenvectors. The plot should look
something like the following.
-
Using a data set of your choice, execute a PCA analysis
using at least three dimensions of data. If you wish, you
can use the Iris data set,
which is simple and has a nice PCA result. Do not use the
iris "class" feature in the PCA analysis. In your report,
discuss/include the following items.
- Is your data homogeneous or heterogeneous? (If heterogeneous, be sure to normalize the data prior to executing the PCA)
- How many significant dimensions exist in the features you chose for the PCA analysis? Another way of asking the question is how many eigenvectors are required to represent at least 90\% of the data variation?
- Look at the first eigenvector. What dimensions are the primary contributors to it (have the largest coefficients)? Are those dimensions negatively or positively correlated? Does the result makes sense?
- Show a plot of your transformed data using the first three eigenvectors.
- Come up with an acronym or name for your program. Be creative. The success of your program may, in the end, be completely determined by how cool your acronym is. Then again, it's success may have something to do with the quality of your work. But it never hurts to have a cool name.
Extensions
- Enable reading and writing the PCA data an analysis as a CSV file. You will need to somehow store the eigenvectors, eigenvalues, and column averages along with the projected data. Note that there should be as many eigenvectors, eigenvalues, and means as there are columns of data.
- Add other features, like the ability to name an analysis.
- Enable the user to select up to five columns from the PCA analysis to plot (x, y, z, color, size).
- Enable the user to select up to five columns, intermixed from the original data and the PCA analysis to plot. For example, try plotting the Australia Coast data using Latitude and Longitude for the x and y spatial axes, then using the projections onto the first two eigenvectors for color and size.
- Demonstrate your system on more data sets and discuss the results.
Report
Make a wiki page for the project report.
- Write a brief summary, separate from the body of your report, of your project that describes the purpose, the task, and your solution to it. It should describe the task, the key parts of your solution, and the result of your work (did it work, what can you do with your GUI?). The summary should be 200 words or less.
- Write a brief explanation of how to run a PCA analysis using your GUI. Include any extensions or enhancements you implemented.
- Incorporate screen shots showing a visualization of the Australia Coast data set into your report. Focus integrating the text and figures.
- Include the analysis of your own data set
using PCA. Be sure to answer each of the questions from
task 5. Include a picture of the plot.
These are some guidelines to consider when writing this section of your report.
- Did this analysis help you learn anything from your data set?
- If so, what? (i.e. summarize your results) If not, why not? (i.e. explain why this method was not appropriate for your data)
- Are all of your scatter plot results properly labeled? Is it clear which features are plotted along each axis It is OK if the infomration is in the text instead of on the image.
- Are all numeric results properly labeled? Is it clear what each number represents and what its unit is? Is the label itself clearly explained?
- Are numeric results presented in a concise, easy-to-read manner? For example, if there are most than 3 related numbers, a table or graph might be more appropriate than having the numbers directly in the text.
- Describe any extensions or enhancements you implemented. Include pictures as appropriate.
- Acknowledgements: a list of people you worked with, including TAs, and instructors. Include in that list anyone whose code you may have seen, such as those of friends who have taken the course in a previous semester.
Handin
Once you have written up your assignment, give the page the label:
cs251s18project6
Put your code your private handin directory on Courses. Please make sure you are organizing your code by project in the Private subdirectory.