Final Project Design
The final project is to develop novel visualizations and undertake an analysis of a selected data set. You can customize your code, as needed. You may work with a partner for this project. Use the lab time to identify a partner, if any, and to explore potential projects.
Data sets you can consider for this project include the following.
- Anything on the UCI Machine Learning site. Find and read some of the source papers to identify interesting questions.
- A data set of your choice, in consultation with the professor.
- Word Priming data set. This data set looks at the relationship between a word recalled by a subject and a primer word given to them. Specifically, participants were given a one word cue--a monosyllabic word--and asked to respond as quickly as possible with the first word that came to mind that 'looked or sounded' like the cue. The goal of generating this data set is to obtain free association norms for a large set of items with measures of forward and backward association strength (i.e., what is the probability that a given cue will elicit a given response and vice versa). We are also interested in what characteristics of cues and responses predict association strength or response frequency. Many characteristics of the primar and recalled words are included in the data set. This is an excellent choice for doing prediction with a learned classifier.
- Systems Bio data set. This data set tracks 123 cells in the circadian clock over time. It is a good set for examining and visualizing data that change over space and time. One question you might ask is: are cells that are tightly synchronized to each other also close to each other in the tissue?
- Baseball Strike zone data. This data set looks at the relationship between a coach or player berating an umpire about a call and subsequent changes in the strike zone as measured by how the umpire judges strikes and balls. The main question here is to generate an analysis and visualization methodology that shows any relationships between a complaint event and subsequent changes in the strike zone.
If you are using the word priming, sys-bio, or baseball data sets, you can get them from the professor.
Once you have selected a data set, pick 1-3 questions that you want to answer using the data set. These questions can be observational, such as identifying relationships between variables, or they can be predictive, such as predicting a dependent variable from a set of independent variables. Consider what kinds of useful visualizations will be helpful in explaining the answers to the questions.
Selecting your questions should be done in consultation with the professor or the developer of the data set. For some data sets, the questions will have specific answers you can write programs to generate. In other cases, the questions will be answered by a program that lets the user interact with the data.
After selecting your questions, develop a plan of analysis that outlines the process and methods you will use to generate an answer. In some cases, this will involve simply using your analysis and visualization code. In other cases, it may involve designing code to create custom visualizations or analyses.
Write up your project design as a wiki page with the label cs251s15project9. This should be done by April 27th. When you are done with your design, then start executing your plan to complete the final project.