Clustering
Due Monday 10 April 2017
The goal of this week's project is to add basic Kmeans clustering analysis to your program. During the lab you will implement your own version of Kmeans clustering.
Tasks
You will implement two versions of Kmeans clustering. One will use the builtin Numpy functions, one you will write yourself. You can use the Numpy version as a reference while debugging (and as a potentially faster version for very large data sets.

Add a function to your analysis module that uses Numpy's builtin
kmeans capabilities. It should take as arguments a data set, the
headers to use for clustering, and the number of clusters to
create. It will return the cluster means, the cluster ID of each
point, and the representation error.
 import scipy.cluster.vq as vq

Write the following kmeans_numpy function.
def kmeans_numpy( d, headers, K, whiten = True): '''Takes in a Data object, a set of headers, and the number of clusters to create Computes and returns the codebook, codes, and representation error. ''' # assign to A the result of getting the data from your Data object # assign to W the result of calling vq.whiten on A # assign to codebook, bookerror the result of calling vq.kmeans with W and K # assign to codes, error the result of calling vq.vq with W and the codebook # return codebook, codes, and error
 Test your code with this test file using this data. You should get these results. Note the clusters/ids may be reversed.

Now write your own Kmeans function. The basic Kmeans algorithm is as follows.
initialize a set of K mean vectors loop identify the closest cluster mean for each data point compute a new set of cluster means calculate the error between the old and new cluster means if the error is small enough, break identify the closest cluster mean for each data point return the new cluster means and the cluster IDs and distances for each data point
To make the above function easier, write two helper functions: kmeans_init and kmeans_classify.

The kmeans_init should take in the data, the number of
clusters K, and an optional set of categories (cluster labels
for each data point) and return a numpy matrix with K rows,
each one repesenting a cluster mean. If no categories are
given, a simple way to select the means is to randomly choose
K data points (rows of the data matrix) to be the first K
cluster means. You want to avoid have two of the initial means
being the same.
If you are given an Nx1 matrix of categories/labels, then compute the mean values of each category and return those as the initial set of means. You can assume the categories given in the data are zeroindexed and range from 0 to K1.

The kmeans_classify should take in the data and cluster means
and return a list or matrix (your choice) of ID values and
distances. The IDs should be the index of the closest cluster
mean to the data point. The default distance metric should be
sumsquared distance to the nearest cluster mean.
You can test your kmeans_classify with this file. You should get this result.

Copy the following algorithm, then go through it and make
sure you understand what it does. This is the core Kmeans
algorithm.
def kmeans_algorithm(A, means): # set up some useful constants MIN_CHANGE = 1e7 MAX_ITERATIONS = 100 D = means.shape[1] K = means.shape[0] N = A.shape[0] # iterate no more than MAX_ITERATIONS for i in range(MAX_ITERATIONS): # calculate the codes codes, errors = kmeans_classify( A, means ) # calculate the new means newmeans = np.zeros_like( means ) counts = np.zeros( (K, 1) ) for j in range(N): newmeans[codes[j,0],:] += A[j,:] counts[codes[j,0],0] += 1.0 # finish calculating the means, taking into account possible zero counts for j in range(K): if counts[j,0] > 0.0: newmeans[j,:] /= counts[j, 0] else: newmeans[j,:] = A[random.randint(0,A.shape[0]1),:] #randint is inclusive # test if the change is small enough diff = np.sum(np.square(means  newmeans)) means = newmeans if diff < MIN_CHANGE: break # call classify with the final means codes, errors = kmeans_classify( A, means ) # return the means, codes, and errors return (means, codes, errors)
Once you have the above functions written, then write the toplevel kmeans function, which should be similar to the numpy version from step 1.
def kmeans(d, headers, K, whiten=True, categories = ''): '''Takes in a Data object, a set of headers, and the number of clusters to create Computes and returns the codebook, codes and representation errors. If given an Nx1 matrix of categories, it uses the category labels to calculate the initial cluster means. ''' # assign to A the result getting the data given the headers # if whiten is True # assign to W the result of calling vq.whiten on the data # else # assign to W the matrix A # assign to codebook the result of calling kmeans_init with W, K, and categories # assign to codebook, codes, errors, the result of calling kmeans_algorithm with W and codebook # return the codebook, codes, and representation error
Test your kmeans function using this file. Your results should be pretty much identical to the numpy version.
Note that if you whiten your data before clustering, then the cluster means will be in the whitened coordinate system. The scipy whiten function divides each column by its standard deviation (according to the documentation it does not subtract off the mean first. Therefore, if you want to plot those cluster means in the original data space you need to do the following for each dimension, where c_i is the cluster mean in the original data space for dimension i, cw_i is the whitened cluster mean, and s_i is the standard deviation.
c_i = c_w*s_i
Note that, before whitening, you may want to subtract the original data mean and store it. In that case, you need to add the data mean to the right side of the expression above to get the clusters in the original data space.

The kmeans_init should take in the data, the number of
clusters K, and an optional set of categories (cluster labels
for each data point) and return a numpy matrix with K rows,
each one repesenting a cluster mean. If no categories are
given, a simple way to select the means is to randomly choose
K data points (rows of the data matrix) to be the first K
cluster means. You want to avoid have two of the initial means
being the same.
When you are done with the lab, go ahead and start the project.