CS 251: Assignment #7

### Clustering

Due Monday 13 April 2015

The goal of this week's project is to add basic K-means clustering analysis to your program. During the lab you will implement your own version of K-means clustering.

You will implement two versions of K-means clustering. One will use the built-in Numpy functions, one you will write yourself. You can use the Numpy version as a reference while debugging (and as a potentially faster version for very large data sets.

1. Add a function to your analysis module that uses Numpy's built-in k-means capabilities. It should take as arguments a data set, the headers to use for clustering, and the number of clusters to create. It will return the cluster means, the cluster ID of each point, and the representation error.
1. import scipy.cluster.vq as vq
2. Write the following kmeans_numpy function.
```def kmeans_numpy( d, headers, K, whiten = True):
'''Takes in a Data object, a set of headers, and the number of clusters to create
Computes and returns the codebook, codes, and representation error.
'''

# assign to A the result of getting the data from your Data object
# assign to W the result of calling vq.whiten on A

# assign to codebook, bookerror the result of calling vq.kmeans with W and K
# assign to codes, error the result of calling vq.vq with W and the codebook

# return codebook, codes, and error
```
3. Test your code with this test file using this data. You should get these results. Note the clusters/ids may be reversed.
2. Now write your own K-means function. The basic K-means algorithm is as follows.
```  initialize a set of K mean vectors
loop
identify the closest cluster mean for each data point
compute a new set of cluster means
calculate the error between the old and new cluster means
if the error is small enough, break
identify the closest cluster mean for each data point
return the new cluster means and the cluster IDs and distances for each data point
```

To make the above function easier, write two helper functions: kmeans_init and kmeans_classify.

• The kmeans_init should take in the data, the number of clusters K, and an optional set of categories and return a numpy matrix with K rows, each one repesenting a cluster mean. If no categories are given, a simple way to select the means is to randomly choose K data points (rows of the data matrix) to be the first K cluster means.

If you are given an Nx1 matrix of categories, then compute the mean values of each category and return those as the initial set of means. You can assume the categories are zero-indexed and range from 0 to K-1.

• The kmeans_classify should take in the data and cluster means and return a list or matrix (your choice) of ID values and distances. The IDs should be the index of the closests cluster mean to the data point. The distances should be the Euclidean distance to the nearest cluster mean.

You can test your kmeans_classify with this file. You should get this result.

• Copy the following algorithm, then go through it and make sure you understand what it does. This is the main K-means algorithm.
```def kmeans_algorithm(A, means):
# set up some useful constants
MIN_CHANGE = 1e-7
MAX_ITERATIONS = 100
D = means.shape[1]
K = means.shape[0]
N = A.shape[0]

# iterate no more than MAX_ITERATIONS
for i in range(MAX_ITERATIONS):
# calculate the codes
codes, errors = kmeans_classify( A, means )

# calculate the new means
newmeans = np.zeros_like( means )
counts = np.zeros( (K, 1) )
for j in range(N):
newmeans[codes[j,0],:] += A[j,:]
counts[codes[j,0],0] += 1.0

# finish calculating the means, taking into account possible zero counts
for j in range(K):
if counts[j,0] > 0.0:
newmeans[j,:] /= counts[j, 0]
else:
newmeans[j,:] = A[random.randint(0,A.shape[0]),:]

# test if the change is small enough
diff = np.sum(np.square(means - newmeans))
means = newmeans
if diff < MIN_CHANGE:
break

# call classify with the final means
codes, errors = kmeans_classify( A, means )

# return the means, codes, and errors
return (means, codes, errors)
```

Once you have the above functions written, then write the main kmeans function, which should be similar to the numpy version from step 1.

```def kmeans(d, headers, K, whiten=True, categories = ''):
'''Takes in a Data object, a set of headers, and the number of clusters to create
Computes and returns the codebook, codes and representation errors.
If given an Nx1 matrix of categories, it uses the category labels
to calculate the initial cluster means.
'''

# assign to A the result getting the data given the headers
# if whiten is True
# assign to W the result of calling vq.whiten on the data
# else
# assign to W the matrix A

# assign to codebook the result of calling kmeans_init with W, K, and categories

# assign to codebook, codes, errors, the result of calling kmeans_algorithm with W and codebook

# return the codebook, codes, and representation error
```

Test your kmeans function using this file. Your results should be pretty much identical to the numpy version.

Note that if you whiten your data before clustering, that the cluster means will be in the whitened coordinate system. Therefore, if you want to plot those cluster means in the original data space you need to do the following for each dimension, where c_i is the cluster mean in the original data space for dimension i, cw_i is the whitened cluster mean, m_i is the data mean, and s_i is the standard deviation.

c_i = c_w*s_i + m_i

When you are done with the lab, go ahead and start the project.