Project 8: Decision Tree Analysis
Project due Monday night Apr 21, 2014
The purpose of this assignment is to give you the opportunity to code your own 1R decision tree. The design is set up so that you can extend it to include trees with more than 1 decision node.
- Write a method that will find the best split point for a given feature. It should do so by testing multiple split points (e.g. every value of that feature) and choosing the split point with the smallest info out. It returns all the information necessary to make the child nodes representing this decision, along with the value of info out.
# Uses fields: # self.all_data # self.my_indices # input: # feature_idx: index of the feature to split on. min_num_in_group # min_num_in_group: if either group1 or group2 would have fewer members that # min_num_in_group, then the function won't compute the outinfo. Instead it will # return (None,None,None) # output: (information_out, feature_split_idx, # feature_split_thresh, g1idx, g2idx)def find_split_point( self, feature_idx, min_num_in_group ):
- Test your function with oneR_test3.py. It uses this data file. Here is the expected output.
- Write a method become_decision_node that searches through the features for the one that provides the lowest information out. Make two new child nodes with the output from find_split_point for the best feature.
# This node shouldn't be a leaf. It should be a decision node. So # have it find the best split point and make two children. # Note that this is not recursive. It should become so if you are going # to use this for a tree other than a 1R tree. def become_decision_node( self ):
- To "make a child node", you will need to make a new
DecisionTreeNode and store it in the list of the node's children. e.g
self.kids.append( DecisionTreeNode( self.all_data, self.all_class_vals, best_g1idx ) )Append them to the list in order, so that the first kid has group 1 indices and the second kid has group 2 indices.
- We are going to need to store information in this node about what decision this node helps us make (i.e. how did we decide which training data points went into the first child and which ones went into the second child?). So we need to give self.best_split_thresh the threshold value we used and self.self.best_f the index of the feature we used.
- To "make a child node", you will need to make a new DecisionTreeNode and store it in the list of the node's children. e.g
- Test your function with oneR_test4.py. It uses this data file. Here is the expected output.
- Now we are ready to make a new class OneRuleDecisionTree that will contain a DecisionTreeNode. We can think of it as a wrapper class that uses the original class to do the tree-building. It will have two methods:
- __init__: This builds the tree, given a set of training data.
- classify: This allows us to classify a set of points using the tree we built earlier.
class OneRuleDecisionTree: # Build the classifier def __init__( self, mat, class_values ): # mat is an NxF matrix of the data # class_values is an Nx1 matrix of the class values # it should contain ints with values between 0 and C-1 # where C is the number of classes N = mat.shape self.root = DecisionTreeNode(mat,class_values,range(N)) # Then tell that node to make two kids, based on the best # split it can find. self.root.become_decision_node() # method for head node. Classify all of these data points. # Input MxF matrix of M data points # output: Mx1 matrix of predicted class values for those data points def classify( self, data_pts ): # follow the path down the decision tree that will classify this point # data_pt is a 1 x F matrix def classify_pt( node, data_pt ): if len(node.kids) == 0: return node.my_class if data_pt[0,node.best_f] <= node.best_split_thresh: return classify_pt( node.kids, data_pt ) else: return classify_pt( node.kids, data_pt ) N = data_pts.shape pred_class_vals = np.zeros( (N,1) ) for i in range(N): pred_class_vals[i,0] = classify_pt( self.root, data_pts[i,:] ) return pred_class_vals
- Test the code with oneR_test5.py. It uses this data file. Here is the expected output. (Yes, I realize this is testing your ability to copy and paste, but I am assuming you would appreciate the test code :) )
- Classify the iris data using both this classifier and your Naive Bayes classifier. Compute the confusion matrices (you can use the training data for both). How do the two classifiers compare?
- Write specialized scripts to test the 1R decision tree classifier on 1 more set of data. The data set must have a class feature (e.g. the name of the iris species). (Note that it was an extension in project 2 to turn enumerated types into numeric types. If you have not completed that extension, then you will need to make sure that the class values are integers between 0 and C-1 where C is the number of classes.)
- Write out your data and classification results to a file, then visualize it using your display.py program.
- Update your GUI to allow 1R decision tree classification (model it after the way you added clustering or Naive Bayes). It would be great to visualize the confusion matrix, too.
- Build the simple tree we talked about in class (with or without pruning).
- Implement cross-validation, if you haven't done so already.
- Analyze more data sets.
For this week's writeup, create a wiki page that shows your 1R Decision Tree results (images are helpful!), analyzes the results of your classifying from tasks 7 and 8, and explains any extensions.
Once you have written up your assignment, give the page the label:
Put your code in a folder named Proj8 on the Courses/CS251 server in your private subdirectory.