CS 251: Lab #2

Data Management

The purpose of this week's lab is to create a Data class that allows you to read and write CSV files. Then you should be able to query the Data object for information.


Data Format

In order to make reading the data straightforward, we're going to use a general format for the data that simplifies the task. In general, the data should have the following properties.

Data Class

The Data class should have methods that tell it to read in a csv data file as described above and to access data (e.g. retrieve all the data for a particular column). The data will be stored in two forms.

  1. Raw form: In the raw form, each line is a list of strings (i.e. the result of reading in the line and calling split(',')). We keep the raw data in this format so that we have an accurate view of the file and so that that we can retrieve values of non-numerical columns. The CSV reader will return each value as a string, by default, so this is the raw form of the data.
  2. Matrix form: Most of the analysis and display will use only the numeric columns of the data. We store those in a numpy matrix. As a numpy Matrix can store data of only one type, this should be a matrix of floating point numbers.

Tasks

  1. Create a python file data.py and start writing the code for your Data class. The constructor for the Data class should have the option of taking in a filename and then reading the data from the file. The data file should be in the format described above. You may also want your constructor to have the option to take in a list of lists that represents a data set. The following is a possible Data constructor def line.
    			def __init__(self, filename = None):
    		

    You will need initialize a number of different fields for the Data class, but you can add them as you need them. For now, think of your constructor as having the following sections.

    			# create and initialize fields for the class
    
    			# if filename is not None
             # call self.read(filename)
    		
  2. Create a read method for reading the data from a file. The method should put the original data in string format into a list of lists, with one sublist for each data point. This is the raw form.

    In addition, the method should store the headers and data types read from the data file. (Note: You might want to write just part of the read method and then test it by writing the accessor methods that get at what you have so far.)

    You will likely want to use this set of fields to manage the raw data:

    • raw_headers (list of all headers)
    • raw_types (list of all types)
    • raw_data (list of lists of all data. Each row is a list of strings)
    • header2raw (dictionary mapping header string to index of column in raw data)

    Note: Once you get to the project, you will need to add fields and code to the read method in order to create and store the numeric data.

    Note: You may test your Data class using testdata1.csv and testdata2.csv.

  3. Write at least these helpful accessor methods. Note that to extract specific columns from a Data object, you will use the column's header (as apposed to an index).
    • get_raw_headers: returns a list of all of the headers.
    • get_raw_types: returns a list of all of the types.
    • get_raw_num_columns: returns the number of columns in the raw data set
    • get_raw_num_rows: returns the number of rows in the data set. This should be identical to the number of rows in the numeric data, so you can get away with writing just one function for this purpose.
    • get_raw_row: returns a row of data (the type is list) given a row index (int).
    • get_raw_value: takes a row index (an int) and column header (a string) and returns the raw data at that location. (The return type will be a string)
  4. You may test your methods with lab2_test1.py if you would like to. Read through the test file and make sure the printed results make sense.
  5. Create a method that nicely prints out the data to the command line.

When you are done with the lab exercises, you may start on the rest of the project.