Project 2: Extracting information
The purpose of this project is to take the next step in working with data to try extracting some kind of information from patterns in the data. You will also make more use of functions and parameters to encapsulate code and build up a useful library of routines.
- If you haven't already set yourself up for working
on the project, then do so now.
- Mount your directory on the Personal server.
- Open the Terminal and navigate to your Project2 directory on the Personal server.
- Open TextWrangler. If you want to look at any of the files you have already created, then open those files.
Download the following CSV file, which contains weather data from the
Maine Lakes Resource Center on five minute intervals since May. If
you open the file, you'll note that it has a lot of header lines
giving details about the data itself.
This data contains 14 fields. For now, we're going to concern ourselves with field 8, PAR. PAR stands for Photosynthetically Active Radiation. Basically, it's a measure of much visible sunlight is shining on the sensor.
The goal of step one in this project is to figure out how many cloudy days and how many sunny days occurred in June and July of 2015. We're going to do this by looking at the PAR value at noon on each day of June and July and seeing how many of those PAR measurements were above 800uE and how many were below.
The first sub-task is to create a grep command that gives all of the lines
of data from June and July. Looking at the time and date field, it has the form
Note that the month and day fields and the hour field can have either one or two digits in it. We want to find all of the date patterns that have a 6 or 7 in the month field, any one or two digits in the month field, and that end in /15. Develop the appropriate grep command and test it on the data to make sure you are getting every line from June and July 2015.
The form of your command will look like this:
grep <juneJulyExpression> MaineLakesResourceCenterData16.csv
where <juneJulyExpression> will be replaced by the actual expression (and don't forget to put it in quotes).
The second sub-task is to extract the PAR value for just those lines that
occur at noon. This is broken into two sub-sub-tasks:
- Find all of the lines containing the string 12:00. Pipe the output of the
grep command from step one to a grep command that extracts all of the
noon lines. Your command should have the form.
grep <juneJulyExpression> MaineLakesResourceCenterData16.csv | grep <noonExpression>
where <noonExpression> will be replaced by the expression to extract only those lines with 12:00 in them (don't forget your quotes). Test this step.
Pipe the output to a cut command that extracts field 8
delimited by commas. Test that this result gives you just the
PAR column. It should give you 61 PAR values. Pipe the
output of the cut command to wc if you want to get a
line/word/character count. Your terminal command for doing
that should look something like below.
grep <juneJulyExpression> MaineLakesResourceCenterData16.csv | grep <noonExpression> | cut <arguments> | wc
- Find all of the lines containing the string 12:00. Pipe the output of the grep command from step one to a grep command that extracts all of the noon lines. Your command should have the form.
The third sub-task is to write a Python program that reads the stream of
PAR data and classifies each value as either a cloudy day or a sunny
day. Use 800uE as the threshold for sunny/cloudy. A value less than
800 means the day is pretty overcast. Your program should also sum
all of the PAR values for sunny days and sum the PAR values for cloudy
days and then compute and print the average for each, along with the
total number of each type of day. There should be 42 sunny days and
19 cloudy days at noon in June and July 2015.
import sys def main(stdin): # assign to nSun the value 0 # assign to sumSun the value 0 # assign to nCloud the value 0 # assign to sumCloud the value 0 # assign to buf the result of calling readline using the stdin parameter while buf.strip() != '': # assign to par the result of casting buf to a float # if the par value is greater than 800 # increment nSun by 1 # increment sumSun by the value of par # else # increment nCloud by 1 # increment sumCloud by the value of par # assign to buf the result of calling readline using the stdin parameter # print out the number of sunny days and the average par of sunny days # print out the number of cloudy days and the average par of cloudy days if __name__ == "__main__": main(sys.stdin)
Take a screenshot of the Terminal with the output. You should include this in your wiki page write-up as an image to demonstrate the correctness of your code. This is required image 1.
- The first sub-task is to create a grep command that gives all of the lines of data from June and July. Looking at the time and date field, it has the form
The second main task is to extract the percentage difference between
the temperature at 1m and the temperature at 7m from the live buoy
data (that is the data you need to curl) for June and July of 2015.
We can break this down into three sub-tasks. The first is to identify all the data lines from June and July, similar to the first sub-task above. The second is to cut out the date/time, 1m, and 7m fields. The third is to read in each line, extract the 2nd and 3rd fields, convert them to floats and calculate the percent difference.
- Write a grep command that identifies all of the live buoy data values from June and July. You can test it by piping the output to less.
- Pipe the grep command to a cut command that extracts the date/time (field 1), 1m temperature (field 11), and 7m temperature (field 18) fields. Test that you are getting a result with three fields and that the fields have the appropriate values (hint: you can open the .csv file in TextWrangler or Excel to determine which values you should expect).
Write a python program (name it mixing.py) that reads lines from stdin. Use the split
function to divide a line by commas. Then cast the second and
third words to floats using the float function. These are the 1m and 7m temperatures.
Compute the percent change C = (1m - 7m)/7m, then print out the
date/time and percent change. Here are comments to guide you as you write the code
import sys def main(stdin): # assign to buf the result of calling readline using the stdin parameter while buf.strip() != '': # assign to words the result of calling buf.split(',') # assign to temp1m the result of casting words to a float # assign to temp7m the result of casting words to a float # assign to change the expression (temp1m - temp7m)/temp7m # print out words and change, separated by a comma # assign to buf the result of calling readline using the stdin parameter if __name__ == "__main__": main(sys.stdin)
When you run the program, redirect the output to the file
mixing.csv. You can do this using the > symbol.
<other commands> | python mixing.py > mixing.csv
Our goal for this project is to examine the surface conditions--wind
speed, precipitation, and sunshine--when the difference in
temperatures between the 1m and 7m measurements drops. The wind speed
and precipitation are available as 5min measurements in the
Maine Lakes Center file (that is the .csv file you downloaded and used in task 2), while the buoy data (the data you curl) is available at 15min
measurements. For this task, we're going to modify the wind speed,
PAR, and precipitation to find the 15min averages from the Maine Lakes Center file.
There are three steps to this process. The first is to extract all of the June and July data, the second is to cut out the four fields we want (date/time, wind speed, PAR, and precipitation), and the third is to pass it to a python program that averages three readings to generate and print out a single reading every 15min.
- Generate a grep command that extracts all of the lines from June and July. This will be the same as the line you used in task 2. Test it again.
- Generate a cut command that extracts the date/time (field 2), Wind Speed (field 4), PAR (field 8), and Rain (field 13). Pipe the output of grep from the previous step to the cut command and test the result.
Write a Python program (name it energy.py) that reads through the data, extracts the
values from each line, and remembers the last three wind, rain,
and PAR values. If the line is on a 15 min interval, then print
out the average of the last three wind, rain, and PAR values.
import sys def main(stdin): # assign to wind0 the value 0.0 # assign to wind1 the value 0.0 # assign to wind2 the value 0.0 # assign to rain0 the value 0.0 # assign to rain1 the value 0.0 # assign to rain2 the value 0.0 # assign to par0 the value 0.0 # assign to par1 the value 0.0 # assign to par2 the value 0.0 # assign to datetime the empty string '' # assign to buf the result of calling readline using the stdin parameter while buf.strip() != '': # assign to wind2 the value in wind1 # assign to wind1 the value in wind0 # assign to rain2 the value in rain1 # assign to rain1 the value in rain0 # assign to par2 the value in par1 # assign to par1 the value in par0 # assign to words the result of calling split on the buf variable with a comma as argument # assign to datetime the value in words # assign to wind0 the result of casting words to a float # assign to par0 the result of casting words to a float # assign to rain0 the result of casting words to a float # if any of the strings ":00:", ":15:", ":30:", or ":45:" are in the datetime string # assign to avgwind the average of wind0, wind1, and wind2 # assign to avgpar the average of par0, par1, and par2 # assign to avgrain the average of rain0, rain1, and rain2 # print the datetime, average wind, average PAR, and average rain, separated by commas # assign to buf the result of calling readline using the stdin parameter if __name__ == "__main__": main(sys.stdin)
Remember to add the appropriate comments to the top of the file (including how to run the program). Test the program and make sure it is printing out values only on 15min intervals. Redirect the output to a file called energy.csv.
After the last two tasks, you should have a file called mixing.csv and
energy.csv. They should have the same number of lines
(use wc to check this).
Use the Unix tool paste to combine the two files together, using a comma as the delimiter. Redirect the output to the file blend.csv. Check that the two date/time files line up properly in the output file.
The final task is to identify strong mixing events--times when the
percent change in temperature is dropping quickly--and then look at
the corresponding surface conditions.
The blend file contains all of the necessary data for all of the dates of interest, so we don't need to use grep or cut. Instead, we just need to cat the file and pipe it to a Python program (name it find_events.py) that does the analysis.
The python program needs to remember the last hour's worth of mixing, wind, rain, and PAR values. That means you will need variables to hold each value. For example, you will want to initizlize mix0, mix1, mix2, and mix3 to zero at the start of your main function.
In the main loop, check to see if the percent change has dropped by 5% or more in the last hour by testing if mix3 - mix0 > 0.05. If it has changed, then print out the date/time, the mix value, and the sum of the last hour's worth of wind, rain, and PAR.
Look at the output of your program and see if there are patterns that arise. What times of day and what conditions seem to occur when the temperature differences are decreasing quickly? How many unique mixing events occured in the two months? Answer these questions explicitly in your write-up. Take a screenshot of the Terminal with any output that helps you to answer them. This is required image 2.
Each assignment will have a set of suggested extensions. The required tasks constitute about 85% of the assignment, and if you do only the required tasks and do them well you will earn a B+. To earn a higher grade, you need to undertake one or more extensions. The difficulty and quality of the extension or extensions will determine your final grade for the assignment. One complex extension, done well, or 2-3 simple extensions are typical.
- Compute other properties of the data. For example, compare the wind gust values with the wind speed values and see what the variation is. What is the largest wind gust in the data set? What is the largest percentage difference between wind speed and wind gust? What days saw the most rain? What days saw the most sun?
- As part of your writeup, show that you can get the same results using different types of patterns in grep. Use your writeup to teach yourself or your peers something about basic regular expressions.
- Instead of looking at PAR at just noon, try looking at multiple times of day and using the average. How different are your counts of sunny/cloudy days using different times or combinations of times?
- Try reformulating task 4 or task 6 to use lists instead of variables. For example, use a list wind instead of the variables wind0, wind1, wind2. If you do this for task 6, it should be easy to change the time range and evaluate conditions over the past 2 hours instead of just the last hour.
Write-up and Hand-in
Turn in your code by putting it into your private hand-in directory on the Courses server. All files should be organized in a folder titled "Project 2" and you should include only those files necessary to run the program. We will grade all files turned in, so please do not turn in old, non-working, versions of files.
Make a new wiki page for your assignment. Put the label cs151sf15project2 in the label field on the bottom of the page. But give the page a meaningful title (e.g. Milo's Project 2).
In general, your intended audience for your write-up is your peers not in the class. Your goal should be to be able to use it to explain to friends what you accomplished in this project and to give them a sense of how you did it. Follow the outline below.
- A brief summary of the task, in your own words. This should be no more than a few sentences. Give the reader context and identify the key purpose of the assignment.
- A description of your solution to the tasks, including any text output or images you created. This should be a description of the form and functionality of your final code. Note any unique computational solutions you developed or any insights you gained from your code's output. You may want to incorporate code snippets in your description to point out relevant features. Code snippets should be small segments of code--usually less than a whole function--that demonstrate a particular concept. If you find yourself including more than 5-10 lines of code, it's probably not a snippet.
- A description of any extensions you undertook, including text output or images demonstrating those extensions. If you added any modules, functions, or other design components, note their structure and the algorithms you used.
- A brief description (1-3 sentences) of what you learned. Think about the answer to this question in terms of the stated purpose of the project. What are some specific things you had to learn or discover in order to complete the project?
- A list of people you worked with, including TAs and professors. Include in that list anyone whose code you may have seen, such as those of friends who have taken the course in a previous semester.
- Double-check the label. When you created the page, you should have added a the label cs151f15sproject2. Make sure it is there.