CS 251: Assignment #6

### Clustering

Due Thursday, 8 April 2010

The goal of this week's lab is to build a K-means clustering algorithm and be able to display the clusters in your visualization system using colors.

The primary task this week is to implement at least one form of clustering. The default is K-means, but you are free to select any other form of clustering that seems appropriate for your data set. Check with me first if you do something besides K-means.

The secondary task this week is to set up a meeting with your client, if you have not already done so, and give them an update. You should also take the chance to talk with them more about what kinds of questions you might be able to explore with the data. The kinds of questions that will fit best with our machine learning methods are ones that have a predictive form or that look for relationships within the data. Try to actually have the meeting no later than the 15th. Sooner is good.

1. The K-means algorithm is outlined in the notes and below. It breaks down into an initialization step, a classification step, and a cluster update step.
```Given: N feature vectors and the number of clusters K

# initialization phase
Initialize K cluster means C by randomly selecting K points and
assigning their values to the cluster means.

Initialize K cluster means C_old to zeros
Initialize a vector Csize of size K to zeros
Initialize a Marker array of length N to -1
Change = a large value

while Change is bigger than a (fairly small) threshold
# classify the data points using the current clusters
for each data point x_i
compare x_i to each cluster
set Marker[i] to the id of the closest cluster

# calculate the average cluster means
C_old = C
set C to all zeros
set Csize to all zeros
for each data point x_i
C[ Marker[i] ] += x_i
Csize[ Marker[i] ] ++

for each cluster c_i
c_i /= Csize[i]

# calculate the change
Change = sum over i of the Euclidean distance of C[i] - C_old[i]
```

When calculating distances in the classification stage, if your feature vectors don't have the same numerical scales (like RGB in an image) you should be using either scaled Euclidean or Mahalanobis distance.

Your clustering algorithm should return the Marker array, or label array as well as the cluster means. The Marker array indicates cluster memberships.

To test out your algorithm, you can use the cluster test data, which is a well behaved 2D data set with four natural clusters.

2. Given a clustering, the user should be able to visualize the clusters in at least one visualization form. Interactive 3D is nice if you have 3D data. Even a 1D histogram or number line plot may be useful.

### Extensions

• Implement a second clustering method or add options to K-means. One option is to initialize and run the K-means algorithm several times, keeping the result with the lowest representation error.
• Create more than one visualization form.
• If you have high dimensional data, implement PCA and do clustering in the eigenspace.

### Writeup

For this week's writeup, make one child page from the main data project page. Put a description of the K-means algorithm as you implemented it and some example visualizations.

### Handin

Once you have written up your assignment, give the page the label:

cs251s10project6