Clustering
Due Monday 16 April 2018
The goal of this week's project is to add basic K-means clustering analysis to your program. You want to be able to execute the following workflow.
- Open a data file
- Select "Cluster Data" (a button or menu item)
- In a dialog, select a set of features to use for clustering and the number of clusters K
- Execute the clustering and save the results
- Plot the data with colors specified by the cluster ID
- View the cluster means
There are a variety of options for implementing the last two steps of the workflow.
- Create a ClusterData class, similar to the PCAData class, that holds all of the cluster meta-information (cluster means and number of clusters K), the cluster IDs, and possibly the original data set. Let the user select a cluster analysis and create plots with the color of the plot determined by the cluster IDs.
- Add the cluster IDs as a new column in the original data used for clustering. Store the meta-information (number of clusters and cluster means) separately in a data structure (possibly a ClusterData class with the meta information and the cluster IDs as the data).
- Write out the clustering meta-information to a CSV file and write out the source data file plus the cluster ID column to a second CSV file. The user can read in the second file and plot it using your standard plot functionality with the cluster ID column as color.
Plan out your design before you start. Think about what fields you need to add to your application class, plan out the look of your GUI and the elements you need to add, and plan how you are going to store the cluster means, number of clusters, and the codes. Then start the tasks below.
Tasks
- Add a write function to your Data class, enabling you to write out a selected set of headers to a specified file. The function should take in a filename and an optional list of the headers of the columns to write to the file. If you have not already done so, you should probably also give your Data object the ability to add a column of data. This functionality is extremely important if you want to save your results and be able to use them, for example, in a paper.
- Add the capability to execute a clustering on the currently open data file. You will need to get from the user the set of data headers to use in the clustering and the number of clusters to create.
-
Implement a method of storing the cluster results as part
of enabling the user to generate a plot using the cluster
ID as the color. As noted above, some options include:
- Write the cluster information (means and K) to a CSV file and write the source data set and the cluster IDs to a second CSV file.
- Add the cluster IDs to the existing data set in memory and store the cluster information (means and K) separately.
- Use a ClusterData object to store everything and treat it like a PCA analysis.
Whatever method you choose, the user should be able to view the cluster means. Writing them to a file is ok. Displaying them in a dialog, similar to the eigenvectors in project 6 is also ok.
-
Enable the user to plot the data using cluster ID as the
basis for the color.
To visualize clusters using color, you want each cluster have a unique color. Ideally, you want to have a pre-selected set of easily differentiated colors from which to choose, rather than picking random colors. In practice, it's often useful to have ten easily differentiated colors and then pick randomly after that if you have more than ten clusters.
Depending upon your implementation, you may want to give the user a checkbox when picking what column to use for coloring the plot that lets the user choose between a smooth color palette or a discriminative color palette.
-
Compute a K-means quality estimate using the equation
below. This K-means quality statistic is called the
Description Length and was defined by Jorma Rissanen. It
balances the error of the fit with the complexity of the
model. The clustering with the smallest Description
Length is generally a good choice for K.
Write a function kmeans_quality in your analysis module that computes the Description Length. The function will need the matrix of errors returned by your kmeans algorithm and the number of clusters K. Compute the sum squared error and use the numpy log2 function to calculate the statistic with the following equation. N is the number of data points. Provide this statistic to the user with each clustering.
-
Using this data set, compute
the clustering results for K from 2 through 8. Because the
starting clusters are randomized, you may want to run the
kmeans algorithm several times and take the best result
(lowest description length). Record the Description
Length for each clustering and include a plot of them in
your report. This data set will work best if you do not
whiten the data first, but it works ok either way.
Include the plots for 3, 4 and 5 in your report.
What is the best number of clusters, given the Description Length statistics (minimum Description Length)? Do the results make sense? Are there clear clusters in this data? Does your K-means algorithm find the apparent clusters? What are the means of each cluster?
- Cluster the Australia Coast data set into 10 clusters using the same dimensions used for the PCA analysis last week and visualize the result. Include a picture of this in your report.
-
Using a data set of your choice, select a set of features
and execute a clustering. Choose your features
intentionally after looking at a plot of those features.
How many natural clusters appear to exist in the data set
when you view it?
Include a picture of your visualization in your report. Does the result make sense given your initial understanding of the data? Do the clusters tell you anything about your data? What does the Description Length statistic tell you about the appropriate number of clusters?
Extensions
- Modify your kmeans_init so it can take in a set of initial means. Using the cluster4.csv test file, show the results of using different initial means on the kmeans result.
-
Using your analysis and data functions, and separate from
your GUI, project a data set into its PCA space, keep only
enough dimensions to represent 90% of the variation in the
data, then cluster it and write the projected data and
cluster IDs to a file. Plot the clusters on the first
three eigenvectors (if you have that many). Clustering
after PCA and dimension reduction is a common practice in
data mining, as it ameliorates the effect of co-related
variables and noise on the clusters. If you do this
extension, compare clustering in the original space with
clustering in the reduced PCA space.
The Australia Coast data set works fairly well for this if you use just the first three eigenvectors.
- Enable the above extension within your GUI. This is a significant extension, if done well.
- Add different distance metrics to your clustering algorithm. Show comparisons of the differences in the results.
- Add features to the clustering workflow, such as letting the user select the distance metric or other parameters of the clustering algorithms.
- Implement other clustering methods.
- Enable the user to view the cluster means in their plots. Give the cluster means names.
- There are many ways of calculating quality statistics for clustering. See this page for some of the options. Compute more quality statistics and show how they compare for the cluster4 data set or others of your choice.
Report
Make a wiki page for the project report.
- Write a brief summary, separate from the body of your report, of your project that describes the purpose, the task, and your solution to it. It should describe the task, the key parts of your solution, and the result of your work (did it work, what can you do with your GUI?). The summary should be 200 words or less.
- Write a brief explanation of how to run a clustering using your GUI. Include any extensions or enhancements you implemented.
- Incorporate the required plots in your report, including the cluster4 and Australia Coast results. Make sure each figure is cleared referenced and explained in the text.
- Include the clustering results and plot for your own data set.
Discuss the results and answer the questions from task 7.
These are some guidelines to consider when writing this section of your report.
- Did this analysis help you learn anything from your data set?
- If so, what? (i.e. summarize your results) If not, why not? (i.e. explain why this method was not appropriate for your data)
- Are all of your scatter plot results properly labeled? Is it clear which features are plotted along each axis It is OK if the infomration is in the text instead of on the image.
- Are all numeric results properly labeled? Is it clear what each number represents and what its unit is? Is the label itself clearly explained?
- Are numeric results presented in a concise, easy-to-read manner? For example, if there are most than 3 related numbers, a table or graph might be more appropriate than having the numbers directly in the text.
- Be sure to document and describe any extensions. Include pictures as appropriate.
- Acknowledgements: a list of people you worked with, including TAs, and instructors. Include in that list anyone whose code you may have seen, such as those of friends who have taken the course in a previous semester.
Handin
Once you have written up your assignment, give the page the label:
cs251s18project7
Put your code your private handin directory on Courses. Please make sure you are organizing your code by project in the Private subdirectory.