Final Project Design

Due in lab 22/23 April 2013

The goal of this week's lab is to select a data set and a plan for analysis for the final project. You may work with one partner on this project and the final project. If you work with a partner, please submit a single plan that includes your names and how you intend to divide the work.


  1. Select a data set to analyze for the final project. You may select one of the sets below or select one of your own.

    • Concussion data set. This data set consists of concussion reporting data from Maine high schools. There are a number of observational questions for this data, including looking at the prevalence of concussions across different sports, ages, weights, genders, and other factors.
    • Word Priming data set. This data set looks at the relationship between a word recalled by a subject and a primer word given to them. Specifically, participants were given a one word cue--a monosyllabic word- -and asked to respond as quickly as possible with the first word that came to mind that 'looked or sounded' like the cue. The goal of generating this data set is to obtain free association norms for a large set of items with measures of forward and backward association strength (i.e., what is the probability that a given cue will elicit a given response and vice versa). We are also interested in what characteristics of cues and responses predict association strength or response frequency. Many characteristics of the primar and recalled words are included in the data set. This is an excellent choice for doing prediction with a learned classifier.
    • Systems Bio data set. This data set tracks 123 cells in the circadian clock over time. It is a good set for examining and visualizing data that change over space and time. One question you might ask is: are cells that are tightly synchronized to each other also close to each other in the tissue?
    • Baseball Strike zone data. This data set looks at the relationship between a coach or player berating an umpire about a call and subsequent changes in the strike zone as measured by how the umpire judges strikes and balls. The main question here is to generate an analysis and visualization methodology that shows any relationships between a complaint event and subsequent changes in the strike zone.
  2. Once you have selected a data set, pick 1-3 questions that you want to answer using the data set. These questions can be observational, such as identifying relationships between variables, or they can be predictive, such as predicting a dependent variable from a set of independent variables.

    Selecting your questions should be done in consultation with the professor or the developer of the data set.

    For some data sets, the question(s) will have specific answers you can write programs to generate. For other data sets, the question(s) will be answered by programs that allow the user to interact with the data.

  3. After selecting your questions, develop a plan of analysis that outlines the process and methods you will use to generate an answer. In some cases this will involve simply using your analysis and visualization code. In other cases it may involve designing code to create custom visualizations or analyses.


No extensions this week.


For this week's writeup, create a wiki page that describes the data set you chose and the plan of analysis you intend to execute.


