CS 251: Assignment #6

Clustering

Due Thursday, 8 April 2010

The goal of this week's lab is to build a K-means clustering algorithm and be able to display the clusters in your visualization system using colors.


Tasks

The primary task this week is to implement at least one form of clustering. The default is K-means, but you are free to select any other form of clustering that seems appropriate for your data set. Check with me first if you do something besides K-means.

The secondary task this week is to set up a meeting with your client, if you have not already done so, and give them an update. You should also take the chance to talk with them more about what kinds of questions you might be able to explore with the data. The kinds of questions that will fit best with our machine learning methods are ones that have a predictive form or that look for relationships within the data. Try to actually have the meeting no later than the 15th. Sooner is good.

  1. The K-means algorithm is outlined in the notes and below. It breaks down into an initialization step, a classification step, and a cluster update step.
    Given: N feature vectors and the number of clusters K
    
    # initialization phase
    Initialize K cluster means C by randomly selecting K points and
    assigning their values to the cluster means.
    
    Initialize K cluster means C_old to zeros
    Initialize a vector Csize of size K to zeros
    Initialize a Marker array of length N to -1
    Change = a large value
    
    while Change is bigger than a (fairly small) threshold
        # classify the data points using the current clusters
        for each data point x_i
             compare x_i to each cluster
             set Marker[i] to the id of the closest cluster
    
        # calculate the average cluster means
        C_old = C    
        set C to all zeros
        set Csize to all zeros
        for each data point x_i
           C[ Marker[i] ] += x_i
           Csize[ Marker[i] ] ++
    
        for each cluster c_i
           c_i /= Csize[i]
    
        # calculate the change
        Change = sum over i of the Euclidean distance of C[i] - C_old[i]
    

    When calculating distances in the classification stage, if your feature vectors don't have the same numerical scales (like RGB in an image) you should be using either scaled Euclidean or Mahalanobis distance.

    Your clustering algorithm should return the Marker array, or label array as well as the cluster means. The Marker array indicates cluster memberships.

    To test out your algorithm, you can use the cluster test data, which is a well behaved 2D data set with four natural clusters.

  2. Given a clustering, the user should be able to visualize the clusters in at least one visualization form. Interactive 3D is nice if you have 3D data. Even a 1D histogram or number line plot may be useful.

Extensions


Writeup

For this week's writeup, make one child page from the main data project page. Put a description of the K-means algorithm as you implemented it and some example visualizations.

Handin

Once you have written up your assignment, give the page the label:

cs251s10project6

Put your code in the COMP/CS251 folder on fileserver1/Academics. Please make sure you are organizing your code by project.