CS 251: Lab #6

Lab 6: Clustering Analysis

Project due Monday night Apr 7, 2014

The goal of this week's lab is to implement a k-means clustering function. All code in the lab will be added to analysis.py.


Tasks

  1. In the first step, we will add a function to the analysis module that will make it easy to call numpy's built-in k-means clustering algorithm. We will do this so that you have a working k-means clustering algorithm to use as a reference for any future debugging.
    1. In analysis.py, import scipy.cluster.vq.
    2. Write a function kmeans_numpy that takes a list of dataColIds and k as input, runs scipy's kmeans clustering algorithm on it, and returns a tuple with the idxs for each data point and the mean of each cluster. If you would like to adapt Stephanie's code, here it is:
      def kmeans_numpy( dataColIds, k ):
          mat = data.get_data_from_list( dataColIds )
          (means,distortion) = vq.kmeans( mat, k )
          (code,dist) = vq.vq( mat, means )
          idxs = np.matrix( code ).T # transform a 1D array to a 2D matrix with 1 column
          return (means,idxs)
              
    3. Test it with clustering_test1.py. You will need to download the iris data set for it to run. Because it uses random initial condtions, no single output is expected. Instead, we expect one of the two outputs shown here.
  2. Next our goal is to replace the call to vq.kmeans with our own code, which will involve the algorithm described below:
        initialize the means (function kmeans_init)
        determine which cluster each point belongs to
        while true
            compute the mean of each cluster
            determine which cluster each point belongs to
            if no points have moved, then
                break
        

    We will break this up into these functions:

    • kmeans_init: This function will take the data (an N x F matrix) and k as input and return a matrix of k randomly-generated means (a k x F matrix).
    • kmeans_classify: This function will take the data (an N x F matrix) and the means (a k x F matrix) as input and return an N x 1 matrix with the index of the cluster each point belongs to as well as an N x 1 matrix which contains the distance from each point to the mean of its cluster. (For the indices, 0 means it belongs with the mean in the first row of the means matrix, 1 means it belongs with the mean in the second row of the means matrix, etc.)
    • kmeans_algorithm: This function will take the data (an N x F matrix) and the initial means (a k x F matrix) as input. It will execute the loop (classifying the points, and recomputing the means) until no points change clusters. It will return the means (a k x F matrix), the cluster indices (an N x 1 matrix), and the distances from each point to its mean (an N x 1 matrix).

    1. Write kmeans_init and test it with clustering_test2.py. The output is random, but here are a few examples.
    2. Write kmeans_classify and test it with clustering_test3.py. The expect output is here.
    3. Copy kmeans_alg:
      def kmeans_algorithm( mat, init_means ):
          MAX_ATTEMPTS = 100
          F = init_means.shape[1]
          k = init_means.shape[0]
          (idxs,distances) = kmeans_classify( mat, init_means )
          means = init_means.copy()
          for attempt in range(MAX_ATTEMPTS):
              # save the current indices
              save_idxs = idxs
              # compute the new "means"
              for i in range(k):
                  (idx0,idx1) = np.nonzero( save_idxs == i )
                  idx0 = np.squeeze( np.asarray( idx0 ))
                  if idx0.size == 0:
                      # Something has gone so wrong that there are no things in this
                      # cluster. So make its mean full of NaN's
                      means[i,:] = np.nan + np.zeros( (1, F) )
                  else:
                      submat = mat[idx0,:]
                      means[i,:] = submat.mean(axis=0)
              # redo the classification
              (idxs,distances) = kmeans_classify( mat, means )
              # did it change?
              diff = np.abs(idxs-save_idxs)
              if diff.sum() == 0:
                  # we are done!
                  break
          return (means,idxs,distances)
          
    4. Write kmeans (modeling it after kmeans_numpy).
    5. Test it with clustering_test4.py. Its output should be the same as that of the first test function, with one exception. It is possible for our version to have empty clusters. So it may be the case that there is a NaN mean.

When you are done with the lab exercises, you may start on the rest of the project.