Data Analysis and Visualization
For this project we'll be analyzing some results from a physics experiment. The experiment involves analyzing EM radiation reflected from a surface inside a magnetic field. We're looking for differences in the reflected spectrum when the magnetic field changes. The data is inherently two dimensional--frequency versus amplitude--but there is some added complexity.
- First, each measurement is repeated several times, so we want to work with averages and calculate standard deviations for each measurement.
- Second, we want to compare the spectra under various magnetic field strengths to the baseline spectrum under the background magnetic field. The comparison can be in the form of a difference or a ratio. The difference (or ratio) is actually the quantity of interest for visualization and analysis. In addition to visualizing the differences or ratios, we also want to view them in the context of the estimated noise in the signal.
The data set is here. To unzip/tar it use the terminal command:
tar xfz long-dataset1.tgz
To get started, you probably want to write separate pre-processing programs to generate the data files for visualization. You are free to write these in a language of your choice, but python scripts are fairly nice to use for this data. The appendix at the end of this page tells you how to process command line parameters in a python script.
All of the laser data files have the same format: the first column is frequency and the second is the response. In addition, there is some header information at the beginning of each file in lines starting with a #. The only catch is that there is a header line in the file without a # just before the start of the numbers.
The names of the data files tell you about the data within. The important parts of the name are the A and B characters and the number following. The A character means it is one of the background field measurements, while the B character indicates it is a measurement taken while the field strength is being stepped up. The following number is the magnetic field strength. All files with the same A/B designation where the magnetic field strength is the same in the first digit (rounded) should be averaged together. Maintain the naming conventions throughout your analysis.
Write a program that takes in a set of laser data files and generates
a data file that has three columns: frequency, average response, and
standard deviation. You can ask the user what to call the output or
compute the filename automatically.
As an example, if you write the data processing as a separate script, you could call your program as below, which would average all the data files that begin with di3_11000_TrepsA and end with .dat.
python myscript.py di3_11000_1TrepsA*.dat
- Run your script on each set of measurements at the same field strength with the same character designiation (A or B). The A set of measurements will be your baseline spectrum against which all other spectra should be compared. You should end up with an A set average, and a B set average for each field strength.
- Write a script that takes in the baseline average file and one of the B set average files, computes the ratio of the B measurements to the A measurements, and writes out a two-column data file with the frequency and the ratio. Use your visualization program to look at this data.
- Write a script that takes in the baseline average file and any number of additional B set average files and generates one data file with the first column being frequency and the rest of the columns being the ratio of the B values to the A values for each file. For an A set and three B sets, the final data file would have four columns.
Write a script that takes in the baseline average file and any number
of the B set average files, computes the difference of each B
measurement with the A measurement, and then computes the t-statistic
for the difference. Wikipedia has a reasonable page for this. Use
the t-statistic for equal variance, equal sample size.
Write out a multi-column data file with the frequency and the t-statistic for each B measurement (if you have the A set and 3 B sets, you'll have 4 columns in all). Use your visualization program to look at this data. For 5 samples in each measurement set (8 DOF), a t-statistic of 1.86 is 95% confidence that the two values are different. Since you can plot frequency versus t-statistic, you can visually see which ones are significant. Try coloring these plots by the t-statistic, in addition to using it as the y-coordinate.
- Modify your visualization program so you can select one column as the X axis and multiple columns for the Y axis. Plot each column using a different color.
- Integrate some or all of the data processing into your application. Be very careful how you go about this and put together a good design before you start coding. Since we are using a small number of data sets this semester, you may just want to create a menu for each data set type that has all of the various options for that type of data.
- Prior suggested extensions for your visualization system that you haven't yet implemented.
The writeup for each weekly project should be a brief summary of what you did along with some screen shots, graphs, or tables of results, depending upon the assignment. Please organize the writeup as follows.
- Title of the project and your name
- An abstract describing what you did in 200 words or less.
- A brief description of code you wrote or analysis you undertook for the project.
- Figures, screen shots, graphs, tables, or other results.
- A brief description of what you learned.
Make your writeup for the project a wiki page in your personal space. If you have questions about making a page, stop by during office hours 1-3pm on Mondays or Tuesdays.
Once you have written up your assignment, give the page the label:
Do not put code on your writeup page or anywhere it can be publicly accessed. To hand in code, attach it to an email and send it to the prof. Please do not copy the file into your email, but keep it as a separate attachment.
Appendix: Command Line Arguments
Using command line arguments in python is easy. The standard package sys provides access to all of the command line arguments, including the name of the python file that was executed.
If you put import sys at the top of your file, then you can access the variable sys.arg, which holds a list of all the command line arguments to python. The first element of the list is the name of the file called, the second element and so on are any additional arguments. The following is an example that prints out all of the command line arguments.
import sys def main(): for argument in sys.argv: print argument if __name__ == "__main__": main()
If you put the above code in a file test.py, and then run test.py using:
python test.py a b c d
then it should print out:
test.py a b c d