CS 251: Assignment #3

Data Manipulation

Due midnight, 25 February 2010

The goal of this week's lab is to get you working with numpy, matrices, and your data. The following week we'll integrate the data with the GUI and implement the visualization pipeline.


The result of your work this week should be a DataSet class that holds the data as a numpy matrix along with a set of state variables and methods that enable simple data manipulation and transformations.

  1. Create a new python file DataSet.py (you're free to choose your own names for all of this, I'm just giving suggestions). Build the DataSet class in this file. The most important field in the DataSet class will be a numpy matrix that holds the data (at least the numerical fields). If you have some non-numerical data, you will need to develop a method for storing and managing it. If possible, encode non-numerical data as integer values. You can build a dictionary to do the translation.

    In addition to a default empty DataSet, make an optional argument for the DataSet class constructor that takes in a list of lists and puts the data into the numpy matrix. Then create a method that nicely prints out the data to the command line. Test your methods with some simple examples.

    The NumPy documentation includes a nice pdf book.

    Come up with a plan on how to handle meta-data. One option would be to give it to the constructor as an additional optional argument that is a list of list pairs, where each pair is two strings: the name of the column and its meta-data information. Come up with something that works with your data set. Don't worry too much about it being a general solution to the problem (that's really hard).

  2. Add methods for reading and writing the data to a file. The input format should be whatever format(s) your client is(are) using to store the data. The output format should be a CSV file compatible with Excel. As you design your I/O programs, think about how to manage metadata such as column labels. This information should be stored in the DataSet class. Make sure you test your output file with Excel to ensure it does what you expect.
  3. Create useful methods such as the following.
    • value - takes in a row and column and returns the data value.
    • point - takes in a row index and returns the data vector.
    • columns or dimensions - returns the number of variables in each data point.
    • size - returns the number of data points.
    • range - returns a list of 2-element lists with the minimum and maximum values in each column. With an optional index, returns the range for a single column.
    • mean - returns a list of the mean value for each column. With an optional index, returns the mean for just that column.
    • stdev - returns a list of the standard deviation for each column. With an optional index, returns the stdev for just that column
  4. Create a method select that returns a numpy matrix with just the selected columns. The function should take in a list of indices and return the matrix. You'll use this function when the user selects which axes to use for plotting. You probably want to limit the number of possible indices to five.
  5. Based on the needs of your client/data set, develop 1-2 more useful methods for your DataSet class. These may be accessor functions or data transformations. In some cases, for example, you may want the largest or smallest value over a set of columns (e.g. the earliest migration arrival date for a particular species over the last decade).



For this week's writeup, make one child page from the main data project page. On it, describe your DataSet class API, with brief descriptions of all the functions, their inputs, outputs, and purpose. If you do anything with the GUI, put a brief description on a separate wiki page (put the project3 label on both).


Once you have written up your assignment, give the page the label:


Put your code in the COMP/CS251 folder on fileserver1/Academics. Please make sure you are organizing your code by project.