Lab 2: Managing Data
Project due Monday night Feb 24, 2014
The purpose of this week's lab is to develop a Data class that can read in a properly formatted csv data file.
In order to make reading the data straightforward, we're going to use a general format for the data that simplifies the task. In general, the data should have the following properties.
- The data should be in CSV format with commas separating different entries.
- The first row of the CSV data file should be the variable names. There must be a name for each column.
- The second row of the data should be the variable types: numeric, string, enum, and date. Numeric types can be either integers or floating point values; strings are arbitrary strings; enum implies there are a finite number of values but they can be strings or numbers;a date should be interpreted as a calendar date.
- Missing numeric data should be specified by the number -9999 in integer format. A decimal would imply an actual value.
- Any row that begins with a hash symbol should be ignored by the reader.
- You are free to use the CSV module, which is standard in Python 2.7. The 2.7 documentation is here.
The Data class should have methods that tell it to read in a csv data file as described above and to access data (e.g. retrieve all the data for a particular column). The data will be stored in two forms.
- Raw form: In the raw form, each line is a list of strings (i.e. the result of reading in the line and calling split(',')). We keep the raw data in this format so that we have an accurate view of the file and so that that we can retrieve values of non-numericl columns.
- Matrix form: Most of the analysis and display will use only the numeric columns of the data. We store those in a numpy matrix.
In lab, we will write the methods associated with the raw data. They will all have the word raw in them. In the project, we will write the methods associated with the numeric data.
- Create a python file data.py and write the code for the Data class. The constructor for the Data class should have the option of taking in a filename and then reading the data from the file. The data file should be in the format described above. You may also want your constructor to be able to take in a list of lists that represents a data set, but this is optional.
Create a method for reading the data from a file. The method should put the original data in string format into a list of lists, with one sublist for each data point. In addition, the method should store the headers and types read from the data file. (Note: You might want to write just part of the read method and then test it by writing the accessor methods that get at what you have so far.)
You will likely want to use this set of fields to manage the raw data:
- raw_headers (list of all headers)
- raw_types (list of all types)
- raw_data (list of lists of all data. Each row is a list of strings)
- header2raw (dictionary mapping header string to index of column in raw data)
Note: Once you get to the project, you will need to add fields and code to the read method in order to handle the numeric data.
- Write at least these helpful accessor methods. Note that to extract specific columns from a Data object, you will use the column's header (as apposed to an index).
- get_raw_headers: returns a list of all of the headers.
- get_raw_types: returns a list of all of the types.
- get_raw_num_columns: returns the number of columns in the raw data set
- get_raw_num_rows: returns the number of rows in the data set
- get_raw_row: returns a row of data (the type is list) (Note: since there will be the same number of rows in the raw and numerica data, Stephanie is writing just one method and isn't added the name raw to this one. You can do something different if you want.)
- get_raw_value: takes a row index (an int) and column header (a string) and returns the raw data at that location. (The return type will be a string)
- You may test your methods with lab2_test1.py if you would like to.
- Create a method that nicely prints out the data to the command line.
When you are done with the lab exercises, you may start on the rest of the project.