CS 251: Lab #8

Lab 8: Decision Tree Analysis

Project due Monday night Apr 21, 2014

The purpose of this lab is to write the preliminary functions to support a 1R decision tree algorithm. In the project, you will fully implement it. As with the Naive Bayes classifier, you will not be required to update your display.


Tasks

  1. Make a folder for project 8. Put the latest and greatest versions of data.py, analysis.py, and display.py in it. You will be adding code to analysis.py
  2. The first function we need is the one that computes the entropy of a set of values.
            # Returns the entropy of a set of values.
            # Vals should be a single-row matrix of non-negative numbers
            # Returns a scalar value.
            def entropy( vals ):
    
  3. Test your function with test_entropy.py. Here is the expected output.
  4. The rest of the project will involve creating a class to represent each node of a decision tree. We will call it a DecisionTreeNode and it will contain a set of fields that let use access the entire data set and a set of fields that refer to just the points that "make it" to this node. It will also have fields that will reference its child nodes. Copy and paste this class declaration and initialization method into analysis.py.
    class DecisionTreeNode:
        def __init__( self, all_data, all_class_vals, my_indices ):
            # the set of data we need to index into
            # (each node in the same tree will have the same value for
            # these fields)
            self.all_data = all_data # N x F matrix
            self.all_class_vals = all_class_vals # N x 1 matrix of ints, with class vals 0, 1, ... Num_Classes-1        
            self.num_class_vals = np.max( all_class_vals ) +1
    
            # information specific to this node
            self.my_indices = my_indices # row indices (list) (which points make it this far in the tree)
            # compute the entropy for this node, based on the number of
            # rows with each class value
            my_num_per_class = np.zeros( (1,self.num_class_vals) )
            for idx in self.my_indices:
                my_num_per_class[0,self.all_class_vals[idx,0]] += 1
            self.my_class = np.argmax( my_num_per_class )  # the class value of the majority of data pts in the node
            self.my_entropy = entropy( my_num_per_class )
    
            # information to be filled in later, if this node becomes a decision node
            self.kids = [] # will be list of DecisionTrees
            self.best_f = None
            self.best_split_thresh = None
    
  5. Test your function with oneR_test1.py. It uses this data file. Here is the expected output.
  6. Write a method compute_outinfo_of_this_split_point that will help us to evaluate the effectiveness of making a decision using a given feature and a given threshold for that feature. The output provides the information we need to construct child nodes. It includes the info_out (the weighted average of the child entropies) and the indices of the points that will go into each child node.
        # uses these fields:
        # self.my_indices
        # self.num_class_vals
        # self.all_class_vals
        # input: feature: M x 1 matrix of the values for this feature, 
        #                 including only those rows that made it to this node.
        #     thresh: the threshold to use to split the data. all rows with a value 
        #             for this feature that is <= threshold will go into group1. 
        #             all rows with a value for this feature that is > threshold 
        #             will go into group2.
        #     min_num_in_group: if either group1 or group2 would have fewer members 
        #            than min_num_in_group, then the function won't compute the outinfo. 
        #            Instead it will return (None,None,None)
        # output: (info_out,g1idx,g2idx)
        #    info_out: the weight averages of the entropies of this nodes children, 
        #        if we choose to use this split point and make those children nodes
        #    g1idx: an NG1x1 array indicating which rows would belong in group1
        #            where NG1 is the number of points in group 1
        #    g2idx: an NG2x1 array indicating which rows would belong in group2
        #            where NG2 is the number of points in group 2
        def compute_outinfo_of_this_split_point( self, feature, thresh, min_num_in_group ):
    

    Notes:

    • You may use Numpy's nonzero function to get a list of indices. For example
      idx1 = np.nonzero( feature <= thresh )[0]
      
      puts a list of indices into the feature of points that have values less than or equal to the threshold. I recommend you familiarize yourself with this function and its output before putting it into your code.
    • To initalize the g1idx and g2idx variables, first determine the number of points in each group, then use np.zeros to construct it. You will want to make sure all the entries are integers, and you can do that easily in the zeros function:
              g1idx = np.zeros( (idx1.size,1), dtype=np.int )
      
  7. Test your function with oneR_test2.py. It uses this data file. Here is the expected output.

When you are done with the lab exercises, you may start on the rest of the project.