Final Project Design
The final project is to develop novel visualizations and undertake an analysis of a selected data set. You can customize your code, as needed. You may work with a partner for this project. Use the lab time to identify a partner, if any, and to explore potential projects.
Data sets you can consider for this project include the following.
- Anything on the UCI Machine Learning site. Find and read some of the source papers to identify interesting questions.
- A data set of your choice, in consultation with the professor.
- Word Priming data set. This data set looks at the relationship between a word recalled by a subject and a primer word given to them. Specifically, participants were given a one word cue--a monosyllabic word--and asked to respond as quickly as possible with the first word that came to mind that 'looked or sounded' like the cue. The goal of generating this data set is to obtain free association norms for a large set of items with measures of forward and backward association strength (i.e., what is the probability that a given cue will elicit a given response and vice versa). We are also interested in what characteristics of cues and responses predict association strength or response frequency. Many characteristics of the primar and recalled words are included in the data set. This is an excellent choice for doing prediction with a learned classifier.
- Study Habits data set. This data set was built to examine whether stable cognitive or personality factors and use of different study strategies and techniques predict achievement as measured by GPA and what factors predict benefits of retrieval practice (i.e., the extent to which taking a test improves retention relative to repeated study). This data set is amenable to both visualization and analysis, particularly predictive analysis that tries to related study habits (independent variables) to performance (dependent variable).
- Systems Bio data set. This data set tracks 123 cells in the circadian clock over time. It is a good set for examining and visualizing data that change over space and time. One question you might ask is: are cells that are tightly synchronized to each other also close to each other in the tissue?
- Baseball Strike zone data. This data set looks at the relationship between a coach or player berating an umpire about a call and subsequent changes in the strike zone as measured by how the umpire judges strikes and balls. The main question here is to generate an analysis and visualization methodology that shows any relationships between a complaint event and subsequent changes in the strike zone.
- Effects of time zones on baseball win percentages. A study a few years ago showed that winning percentages of teams that had to travel from the Pacific Time Zone to the Eastern Time Zone were lower than for teams traveling the other way. One possible project would be to extend that study to compare winning percentages between all possible pairs of time zones and to examine how long it takes teams to recover (second game after arriving in a new time zone, third game, etc.). This project requires a good bit of data manipulation to get the data in a form that can be analyzed. A useful output of this project might be a GUI that allows these relationships to be visualized and analyzed.
- Effects of batter-handedness and pitcher-handedness. This project incorporates two datasets: one for left-handed pitchers and how they pitch against right-handed and left-handed batters, and a similar one for right-handed pitchers. Although batters typically do better against an opposite-handed pitcher, it is not always the case. This is an interesting data set for mining and modeling, including with clustering, linear regression, and logistic regression. It is also full of possible visualization opportunties.
- MCMI Matched Soccer Data Set. This data set contains around 4000 matched pairs--one boy and one girl--of high school soccer players. The data set information includes demographic information, medical history, neurocognitive scores on the ImPACT test, and a 22 item checklist of concussion symptom scores. The primary questions of interest identify similarities and differences between boy and girl soccer players, making use of the matched pairs to reduce the number of confounding variables.
- Bird Arrivals Data Set. Arrival dates of Maine migratory breeding birds. The dataset has 21 years of arrival dates of over 100 migratory breeding birds. The biophysical region (there are 15 in the state) could be used to look for patterns of arrival across the state for each species. Comparisons of birds that eat similar food could be interesting (warblers feeding on caterpillars, aerial insectivores like swallows, swifts and flycatchers, nectar-feeders (hummingbirds and orioles) and aquatic fish-eaters (loons, grebes, Belted Kingfisher,Osprey). For most years, it is also possible to get data on temperature departure from normal (indicating a cold or warm spring) and the NAO (North Atlantic Oscillation), a hemispheric phemenon that affects weather on a broad scale.
- Red-breasted Nuthatch Christmas Count Data. Winter population dynamics of Red-breasted Nuthatches at the level of states and provinces. The data represent state- or province-wide mean abundances (measured as Number of Birds per Party-hour) from 1962 to 2014. Red-breasted Nuthatches stage southern migrations (called irruptions) from their northerly breeding grounds in some years. I am interested in the synchrony and extent of these southern migrations among states and provinces.
If you are using any of the specific data sets listed above, you can get them from the professor.
Once you have selected a data set, pick 1-3 questions that you want to answer using the data set. These questions can be observational, such as identifying relationships between variables, or they can be predictive, such as predicting a dependent variable from a set of independent variables. Consider what kinds of useful visualizations will be helpful in explaining the answers to the questions.
Selecting your questions should be done in consultation with the professor or the developer of the data set. For some data sets, the questions will have specific answers you can write programs to generate. In other cases, the questions will be answered by a program that lets the user interact with the data.
After selecting your questions, develop a plan of analysis that outlines the process and methods you will use to generate an answer. In some cases, this will involve simply using your analysis and visualization code. In other cases, it may involve designing code to create custom visualizations or analyses.
Write up your project design as a wiki page with the label cs251s16project9. This should be done by April 25th. When you are done with your design, then start executing your plan to complete the final project.