CS 251: Assignment #5

Advanced Data Analysis

For this project you'll be analyzing some results from the Colby alumni office. This week we'll be doing some exploratory research, looking for simple rules and patterns in the data. The goal is to help the alumni office better understand the factors that influence giving to the Colby alumni fund.

This week you'll also make use of the Weka package, using the alumni data to explore the basic functionality.


Tasks

The data set is on the Course Wiki.

This data set needs to be treated with care. The data set should not be made publicly available or put on the web in any way that someone outside the course could access it. Please keep all copies of the data set in a single folder on your working computer, and delete any copies that you create on public computers when you are finished.

  1. Write a python script that reads in the data and evaluates 1R rules for the amount of giving in 2008. You'll have to divide that category into two parts. The Alumni office is most interested in being able to identify those who may give over $1000 or $1500. Try creating output categories for both levels of giving and compare the results.

    For numeric attributes, your 1R script should scan through possible cutoff points and identify the single decision value that best predicts the output class.

  2. Using the same two output categories as you created for step one, use the J48 algorithm to build a decision tree in Weka. Explore at least 3 different combinations of attributes so see what is most effective.
  3. Try dividing the 2008 giving data into more categories (3 or more) and repeat the previous step.
  4. Each leaf of the tree constitutes a classification rule. For your best performing decision tree, enumerate the classification rule for each leaf that predicts the highest class of giving. Do these make sense?
  5. Try using alternative methods--Decision Forests and Multilayer Perceptrons, for example--to predict the levels of 2008 giving directly as numeric values. You may need to sample the data set (cut it down in size) if Weka runs out of memory. Alternatively, if you can figure out how to increase the amount of memory available to the Java Virtual Machine you can try running the entire data set.

Extensions


Writeup

The writeup for each weekly project should be a brief summary of what you did along with some screen shots, graphs, or tables of results, depending upon the assignment. Please organize the writeup as follows.

  1. Title of the project and your name
  2. An abstract describing what you did in 200 words or less.
  3. A brief description of code you wrote or analysis you undertook for the project. In particular, describe your best performing classifier for a layperson (your audience is the alumni office).
  4. Figures, screen shots, graphs, tables, or other results.
  5. A brief description of what you learned.

Handin

Make your writeup for the project a wiki page in your personal space. If you have questions about making a page, stop by during office hours 1-3pm on Mondays or Tuesdays.

Once you have written up your assignment, give the page the label:

cs251s09project6

Do not put code on your writeup page or anywhere it can be publicly accessed. To hand in code, attach it to an email and send it to the prof. Please do not copy the file into your email, but keep it as a separate attachment.