Clustering
Due Thursday, 8 April 2010
The goal of this week's lab is to build a Kmeans clustering algorithm and be able to display the clusters in your visualization system using colors.
Tasks
The primary task this week is to implement at least one form of clustering. The default is Kmeans, but you are free to select any other form of clustering that seems appropriate for your data set. Check with me first if you do something besides Kmeans.
The secondary task this week is to set up a meeting with your client, if you have not already done so, and give them an update. You should also take the chance to talk with them more about what kinds of questions you might be able to explore with the data. The kinds of questions that will fit best with our machine learning methods are ones that have a predictive form or that look for relationships within the data. Try to actually have the meeting no later than the 15th. Sooner is good.

The Kmeans algorithm is outlined in the notes and below. It breaks
down into an initialization step, a classification step, and a cluster
update step.
Given: N feature vectors and the number of clusters K # initialization phase Initialize K cluster means C by randomly selecting K points and assigning their values to the cluster means. Initialize K cluster means C_old to zeros Initialize a vector Csize of size K to zeros Initialize a Marker array of length N to 1 Change = a large value while Change is bigger than a (fairly small) threshold # classify the data points using the current clusters for each data point x_i compare x_i to each cluster set Marker[i] to the id of the closest cluster # calculate the average cluster means C_old = C set C to all zeros set Csize to all zeros for each data point x_i C[ Marker[i] ] += x_i Csize[ Marker[i] ] ++ for each cluster c_i c_i /= Csize[i] # calculate the change Change = sum over i of the Euclidean distance of C[i]  C_old[i]
When calculating distances in the classification stage, if your feature vectors don't have the same numerical scales (like RGB in an image) you should be using either scaled Euclidean or Mahalanobis distance.
Your clustering algorithm should return the Marker array, or label array as well as the cluster means. The Marker array indicates cluster memberships.
To test out your algorithm, you can use the cluster test data, which is a well behaved 2D data set with four natural clusters.
 Given a clustering, the user should be able to visualize the clusters in at least one visualization form. Interactive 3D is nice if you have 3D data. Even a 1D histogram or number line plot may be useful.
Extensions
 Implement a second clustering method or add options to Kmeans. One option is to initialize and run the Kmeans algorithm several times, keeping the result with the lowest representation error.
 Create more than one visualization form.
 If you have high dimensional data, implement PCA and do clustering in the eigenspace.
Writeup
For this week's writeup, make one child page from the main data project page. Put a description of the Kmeans algorithm as you implemented it and some example visualizations.
Handin
Once you have written up your assignment, give the page the label:
cs251s10project6
Put your code in the COMP/CS251 folder on fileserver1/Academics. Please make sure you are organizing your code by project.