Title image Spring 2019

Clustering

Due Wednesday 17 April 2019

The goal of this week's project is to add basic K-means clustering analysis to your program. You want to be able to execute the following workflow.

  1. Open a data file
  2. Select "Cluster Data" (a button or menu item)
  3. In a dialog, select a set of features to use for clustering and the number of clusters K
  4. Execute the clustering and save the results
  5. Plot the data with colors specified by the cluster ID
  6. View the cluster means

There are a variety of options for implementing the last two steps of the workflow.

Plan out your design before you start. Think about what fields you need to add to your application class, plan out the look of your GUI and the elements you need to add, and plan how you are going to store the cluster means, number of clusters, and the codes. Then start the tasks below.


Tasks

  1. Add a write function to your Data class, enabling you to write out a selected set of headers to a specified file. The function should take in a filename and an optional list of the headers of the columns to write to the file. If you have not already done so, you should probably also give your Data object the ability to add a column of data. This functionality is extremely important if you want to save your results and be able to use them, for example, in a paper.
  2. Add the capability to execute a clustering on the currently open data file. You will need to get from the user the set of data headers to use in the clustering and the number of clusters to create.
  3. Implement a method of storing the cluster results as part of enabling the user to generate a plot using the cluster ID as the color. As noted above, some options include:
    • Write the cluster information (means and K) to a CSV file and write the source data set and the cluster IDs to a second CSV file.
    • Add the cluster IDs to the existing data set in memory and store the cluster information (means and K) separately.
    • Use a ClusterData object to store everything and treat it like a PCA analysis.

    Whatever method you choose, the user should be able to view the cluster means. Writing them to a file is ok. Displaying them in a dialog, similar to the eigenvectors in project 6 is also ok.

  4. Enable the user to plot the data using cluster ID as the basis for the color.

    To visualize clusters using color, you want each cluster have a unique color. Ideally, you want to have a pre-selected set of easily differentiated colors from which to choose, rather than picking random colors. In practice, it's often useful to have ten easily differentiated colors and then pick randomly after that if you have more than ten clusters.

    Depending upon your implementation, you may want to give the user a checkbox when picking what column to use for coloring the plot that lets the user choose between a smooth color palette or a discriminative color palette.

  5. Compute a K-means quality estimate using the equation below. This K-means quality statistic is called the Description Length and was defined by Jorma Rissanen. It balances the error of the fit with the complexity of the model. The clustering with the smallest Description Length is generally a good choice for K.

    Write a function kmeans_quality in your analysis module that computes the Description Length. The function will need the matrix of errors returned by your kmeans algorithm and the number of clusters K. Compute the sum squared error and use the numpy log2 function to calculate the statistic with the following equation. N is the number of data points. Provide this statistic to the user with each clustering.

  6. Using this data set, compute the clustering results for K from 2 through 8. Because the starting clusters are randomized, you may want to run the kmeans algorithm several times and take the best result (lowest description length). Record the Description Length for each clustering and include a plot of them in your report. This data set will work best if you do not whiten the data first, but it works ok either way. Include the plots for 3, 4 and 5 in your report.

    What is the best number of clusters, given the Description Length statistics (minimum Description Length)? Do the results make sense? Are there clear clusters in this data? Does your K-means algorithm find the apparent clusters? What are the means of each cluster?

  7. Cluster the Australia Coast data set into 10 clusters using the same dimensions used for the PCA analysis last week and visualize the result. Include a picture of this in your report.
  8. Using a data set of your choice, select a set of features and execute a clustering. Choose your features intentionally after looking at a plot of those features. How many natural clusters appear to exist in the data set when you view it?

    Include a picture of your visualization in your report. Does the result make sense given your initial understanding of the data? Do the clusters tell you anything about your data? What does the Description Length statistic tell you about the appropriate number of clusters?


Extensions


Report

Make a wiki page for the project report.

Handin

Once you have written up your assignment, give the page the label:

cs251s19project7

Put your code your private handin directory on Courses. Please make sure you are organizing your code by project in the Private subdirectory.