CS 251: Assignment #3

Data Manipulation

Due midnight, 24 February 2010

The goal of this week's lab is to get you working with numpy, matrices, and your data. The following week we'll integrate the data with the GUI and implement the visualization pipeline.


Tasks

The main result of your work this week should be a DataSet class that holds the data as a numpy matrix along with a set of state variables and methods that enable simple data manipulation and transformations.

  1. Create a new python file DataSet.py (you're free to choose your own names for all of this, I'm just giving suggestions). Build the DataSet class in this file. You probably want to use a NumPy matrix to hold the numeric fields in the data. Note that there may be both non-numeric fields and enumerated fields.

    Assume that, for each variable (column in the csv file), you have three pieces of information, two of which are required to be non-null.

    1. A header or title for the column of data.
    2. A type for the column of data (numeric, enum, date, string)
    3. A description (meta-data) for the column of data (a string), which may be the empty string.

    The constructor for the DataSet class should be able to take in a filename as an optional parameter and read in data from that file. Otherwise, you can design your own interface to the constructor as seems appropriate..

  2. Add methods for reading and writing the data to a file. Assume the data is in CSV format. In addition, the data should have two header rows. The first row should be the headers for each column. The second row should be the data type for each columns: number, string. Re-work your data from project 1 into the new format and save it as a CSV file. You are free to use the csv Python package.

    The output format should be a CSV file compatible with Excel. As you design your I/O programs, think about how to manage metadata such as column labels. This information should be stored in the DataSet class. Make sure you test your output file with Excel to ensure it does what you expect and you can go back and forth between your program and Excel with no change in the data.

  3. Create a method that nicely prints out the data to the command line. Test your methods with the simple examples below, including your data from project 1, once you have converted it to a CSV file and inserted the necessary meta-data.
  4. Create useful methods such as the following.
    • value - takes in a row and column and returns the data value.
    • point - takes in a row index and returns the data vector.
    • columns or dimensions - returns the number of variables in each data point.
    • size - returns the number of data points.
    • range - returns a list of 2-element lists with the minimum and maximum values in each column. With an optional index, returns the range for a single column. This function needs to work only for numeric data. You are free to provide ranges for other data as well using an appropriate comparison metric.
    • mean - returns a list of the mean value for each column. With an optional index, it returns the mean for just that column. This function needs to work only for numeric data.
    • stdev - returns a list of the standard deviation for each column. With an optional index, it returns the stdev for just that column. This function needs to work only for numeric data.
  5. Create a method select that returns a numpy matrix with just the selected columns. The function should take in a list of indices and return the matrix. You'll use this function when the user selects which axes to use for plotting. You probably want to limit the number of possible indices to between one and five. This function needs to work only for numeric data.
  6. Separate from your DataSet class file, create a main program that takes three arguments from the command line: a filename, the x-axis index, and the y-axis index. When executed, the program should generate a plot using matplotlib (pylab) using the two specified variables as the X and Y axes.
  7. Download the ArrivalsClean.csv data set from the Academics server. Make sure your program can read and store the data properly, as it includes all four types of data. Write a main program that takes in a bird name as the command line argument and generates a histogram (using matplotlib) of all of the arrival dates for all years for the selected bird.

Extensions


Writeup

Make a wiki page for the project writeup. On it, describe your DataSet class API, with brief descriptions of all the functions, their inputs, outputs, and purpose.

Describe in your writeup how you store the data internally in your DataSet class, noting how you deal with each different type of data.

Include in your writeup a screen capture of the figures from the last two tasks.

Handin

Once you have written up your assignment, give the page the label:

cs251s11project3

Put your code in the COMP/CS251 folder on fileserver1/Academics. Please make sure you are organizing your code by project.