Advanced Data Analysis
For this project you'll be analyzing some results from the Colby alumni office. This week we'll be doing some exploratory research, looking for simple rules and patterns in the data. The goal is to help the alumni office better understand the factors that influence giving to the Colby alumni fund.
This week you'll also make use of the Weka package, using the alumni data to explore the basic functionality.
The data set is on the Course Wiki.
This data set needs to be treated with care. The data set should not be made publicly available or put on the web in any way that someone outside the course could access it. Please keep all copies of the data set in a single folder on your working computer, and delete any copies that you create on public computers when you are finished.
Write a python script that reads in the data and evaluates 1R rules
for the amount of giving in 2008. You'll have to divide that category
into two parts. The Alumni office is most interested in being able to
identify those who may give over $1000 or $1500. Try creating output
categories for both levels of giving and compare the results.
For numeric attributes, your 1R script should scan through possible cutoff points and identify the single decision value that best predicts the output class.
- Using the same two output categories as you created for step one, use the J48 algorithm to build a decision tree in Weka. Explore at least 3 different combinations of attributes so see what is most effective.
- Try dividing the 2008 giving data into more categories (3 or more) and repeat the previous step.
- Each leaf of the tree constitutes a classification rule. For your best performing decision tree, enumerate the classification rule for each leaf that predicts the highest class of giving. Do these make sense?
- Try using alternative methods--Decision Forests and Multilayer Perceptrons, for example--to predict the levels of 2008 giving directly as numeric values. You may need to sample the data set (cut it down in size) if Weka runs out of memory. Alternatively, if you can figure out how to increase the amount of memory available to the Java Virtual Machine you can try running the entire data set.
- Try building association rules for the data set.
- Evaluate other possible classifiers, e.g. Naive Bayes. For any other classifier you try, make sure you have some understanding of the method.
- Try predicting other aspects of the alumni data, such as the frequency category.
|No gift last year, but gave 3 of last 5 FYs||1|
|Aristotle Society Donors||2|
|No gift last year but at least 1 on last 5 FYs||3|
|All other lybunt category||4|
|Last gift more than 5 years ago (Alums)||6|
|Returning lapsed Donor||8|
|First gift ever in last fiscal year||9|
|Aristotle Eligible Donors||10|
The writeup for each weekly project should be a brief summary of what you did along with some screen shots, graphs, or tables of results, depending upon the assignment. Please organize the writeup as follows.
- Title of the project and your name
- An abstract describing what you did in 200 words or less.
- A brief description of code you wrote or analysis you undertook for the project. In particular, describe your best performing classifier for a layperson (your audience is the alumni office).
- Figures, screen shots, graphs, tables, or other results.
- A brief description of what you learned.
Make your writeup for the project a wiki page in your personal space. If you have questions about making a page, stop by during office hours 1-3pm on Mondays or Tuesdays.
Once you have written up your assignment, give the page the label:
Do not put code on your writeup page or anywhere it can be publicly accessed. To hand in code, attach it to an email and send it to the prof. Please do not copy the file into your email, but keep it as a separate attachment.