Project 2: Extracting information
The purpose of this project is to take the next step in working with data to try extracting some kind of information from patterns in the data. You will also make more use of functions and parameters to encapsulate code and build up a useful library of routines.
If you haven't already set yourself up for working on the project, then do so now.
- Mount your directory on the Personal server.
- Open the Terminal and navigate to your project2 directory on the Personal server.
- Open TextWrangler. If you want to look at any of the files you have already created, then open those files.
Download the following CSV file, which contains weather data from
the Maine Lakes Resource Center on five minute intervals since from
March 14th through July 19th, 2016. If you open the file, you'll
note that it has a lot of header lines giving details about the data
This data contains 14 fields. For now, we're going to concern ourselves with field 8, PAR. PAR stands for Photosynthetically Active Radiation. Basically, it's a measure of much visible sunlight is shining on the sensor.
The goal of step one in this project is to figure out how many cloudy days and how many sunny days occurred in May, June, and July of 2016 (Note, the weather station stopped collecting data on July 19th at 8:30am, so that will be the last entry in the data). We're going to identify cloudy days by looking at the PAR value at noon on each day of May, June, and July and seeing how many of those PAR measurements were above 800uE and how many were below.
The first sub-task is to create a grep command that gives all
of the lines of data from May, June and July. Looking at the
time and date field, it has the form
Note that the month and day fields and the hour field can have either one or two digits in it. We want to find all of the date patterns that have a 5, 6, or 7 in the month field, any one or two digits in the month field, and that end in /16. Develop the appropriate grep command and test it on the data to make sure you are getting every line from May, June, and July 2016.
The form of your command will look like this:
grep -E <mayJuneJulyExpression> MLRC_19.csv
where <mayJuneJulyExpression> will be replaced by the actual expression (and don't forget to put it in quotes).
The second sub-task is to extract the PAR value for just those lines that
occur at noon. This is broken into two sub-sub-tasks:
- Find all of the lines containing the string 12:00. Pipe the output of the
grep command from step one to a grep command that extracts all of the
noon lines. Your command should have the form.
grep -E <mayJuneJulyExpression> MLRC_19.csv | grep <noonExpression>
where <noonExpression> will be replaced by the expression to extract only those lines with 12:00 in them (don't forget your quotes). Test this step.
Pipe the output to a cut command that extracts field 8
delimited by commas. Test that this result gives you just the
PAR column. It should give you 79 PAR values. Pipe the
output of the cut command to wc if you want to
get a line/word/character count (wc is a really useful tool
for counting lines/words/characters in a file). Your terminal
command for doing that should look something like below.
grep -E <mayJuneJulyExpression> MLRC_19.csv | grep <noonExpression> | cut <arguments> | wc
- Find all of the lines containing the string 12:00. Pipe the output of the grep command from step one to a grep command that extracts all of the noon lines. Your command should have the form.
The third sub-task is to write a Python program that reads the stream of
PAR data and classifies each value as either a cloudy day or a sunny
day. Use 800uE as the threshold for sunny/cloudy. A value less than
800 means the day is pretty overcast. Your program should also sum
all of the PAR values for sunny days and sum the PAR values for cloudy
days and then compute and print the average for each, along with the
total number of each type of day. There should be 58 sunny days and
21 cloudy days at noon in May, June, and July 2016 (through July 19).
import sys def main(stdin): # assign to nSun the value 0 # assign to sumSun the value 0 # assign to nCloud the value 0 # assign to sumCloud the value 0 # assign to buf the result of calling readline using the stdin parameter while buf.strip() != '': # assign to par the result of casting buf to a float # if the par value is greater than 800 # increment nSun by 1 # increment sumSun by the value of par # else # increment nCloud by 1 # increment sumCloud by the value of par # assign to buf the result of calling readline using the stdin parameter # print out the number of sunny days and the average par of sunny days # print out the number of cloudy days and the average par of cloudy days if __name__ == "__main__": main(sys.stdin)
Take a screenshot of the Terminal with the output. You should include this in your wiki page write-up as an image to demonstrate the correctness of your code. This is required image 1.
- The first sub-task is to create a grep command that gives all of the lines of data from May, June and July. Looking at the time and date field, it has the form
The second main task is to extract the percentage difference between
the temperature at 1m and the temperature at 7m from the live buoy
data (that is the data you need to curl) for May and June
of 2016. Use the LEA buoy data at the link below for this task.
We can break this down into three sub-tasks. The first is to identify all the data lines from May and June, similar to the first sub-task above. The second is to cut out the date/time, 1m, and 7m fields (fields 9 and 18). The third is to read in each line, extract the 2nd and 3rd fields, convert them to floats and calculate the percent difference.
- Write a grep command that identifies all of the live buoy data values from May and June. You can test it by piping the output to less.
- Pipe the grep command to a cut command that extracts the date/time (field 1), 1m temperature (field 9), and 7m temperature (field 18) fields. Test that you are getting a result with three fields and that the fields have the appropriate values (hint: you can open the .csv file in TextWrangler or Excel to determine which values you should expect).
Write a python program (name it mixing.py) that
reads lines from stdin. Use the split
function to divide a line by commas. Then cast the second and
third words to floats using the float function.
These are the 1m and 7m temperatures. Compute the percent change
C = (1m - 7m)/7m, then print out the date/time and percent
change. Here are comments to guide you as you write the code
import sys def main(stdin): # assign to buf the result of calling readline using the stdin parameter while buf.strip() != '': # assign to words the result of calling buf.split(',') # assign to temp1m the result of casting words to a float # assign to temp7m the result of casting words to a float # assign to change the expression (temp1m - temp7m)/temp7m # print out words and change, separated by a comma # assign to buf the result of calling readline using the stdin parameter if __name__ == "__main__": main(sys.stdin)
When you run the program, redirect the output to the file
mixing.csv. You can do this using the > symbol.
<other commands> | python mixing.py > mixing.csv
Our goal for this project is to examine the surface conditions--wind
speed, gust speed, and sunshine--when the difference in
temperatures between the 1m and 7m measurements drops. The wind speed
and gust speed are available as 5min measurements in the Maine
Lakes Center file (that is the .csv file you downloaded and used in
task 1), while the buoy data (the data you curl) is
available at 15min measurements.
For this task, we're going to modify the wind speed, gust speed, and PAR to find the 15min averages from the Maine Lakes Resource Center file.
There are three steps to this process. The first is to extract all of the May and June data, the second is to cut out the four fields we want (date/time, wind speed, wind gusts, and PAR), and the third is to pass it to a python program that averages three readings to generate and print out a single reading every 15min.
- Generate a grep command that extracts all of the lines from May and June from the MRLC data. This will be identical to the line you used in the previous task. Test it again.
- Generate a cut command that extracts the date/time (field 2), Wind Speed (field 4), Gust Speed (field 5), and PAR (field 8). Pipe the output of grep from the previous step to the cut command and test the result.
Write a Python program (name it energy.py) that
reads through the data, extracts the values from each line, and
remembers the last three wind, gust, and PAR values. If the
line is on a 15 min interval, then print out the average of the
last three wind, gust, and PAR values.
import sys def main(stdin): # assign to wind0 the value 0.0 # assign to wind1 the value 0.0 # assign to wind2 the value 0.0 # assign to gust0 the value 0.0 # assign to gust1 the value 0.0 # assign to gust2 the value 0.0 # assign to par0 the value 0.0 # assign to par1 the value 0.0 # assign to par2 the value 0.0 # assign to datetime the empty string '' # assign to buf the result of calling readline using the stdin parameter while buf.strip() != '': # assign to wind2 the value in wind1 # assign to wind1 the value in wind0 # assign to gust2 the value in gust1 # assign to gust1 the value in gust0 # assign to par2 the value in par1 # assign to par1 the value in par0 # assign to words the result of calling split on the buf variable with a comma as argument # assign to datetime the value in words # assign to wind0 the result of casting words to a float # assign to gust0 the result of casting words to a float # assign to par0 the result of casting words to a float # if any of the strings ":00:", ":15:", ":30:", or ":45:" are in the datetime string # assign to avgwind the average of wind0, wind1, and wind2 # assign to avggust the average of gust0, gust1, and gust2 # assign to avgpar the average of par0, par1, and par2 # print the datetime, average wind, average gust, and average PAR separated by commas # assign to buf the result of calling readline using the stdin parameter if __name__ == "__main__": main(sys.stdin)
Remember to add the appropriate comments to the top of your code file (including how to run the program). Test the program and make sure it is printing out values only on 15min intervals. Redirect the output to a file called energy.csv.
After the last two tasks, you should have a file called mixing.csv and
energy.csv. They should have the same number of lines
(use wc to check this).
Use the Unix tool paste to combine the two files together, using a comma as the delimiter. Redirect the output to the file blend.csv. Check that the two date/time files line up properly in the output file.
The final task is to identify strong mixing events--times when the
percent change in temperature is dropping quickly--and then look at
the corresponding surface conditions.
The blend file contains all of the necessary data for all of the dates of interest, so we don't need to use grep or cut. Instead, we just need to cat the file and pipe it to a Python program (name it find_events.py) that does the analysis.
The python program needs to remember the last hour's worth of mixing, wind, gust, and PAR values. That means you will need variables to hold each value. For example, you will want to initizlize mix0, mix1, mix2, and mix3 to zero at the start of your main function.
In the main loop, check to see if the percent change has dropped by 5% or more in the last hour by testing if mix3 - mix0 > 0.05. If it has changed, then print out the date/time, the mix value, and the sum of the last hour's worth of wind, gust, and PAR.
Look at the output of your program and see if there are patterns that arise. What times of day and what conditions seem to occur when the temperature differences are decreasing quickly? How many unique mixing events occured in the two months? Answer these questions explicitly in your write-up. Take a screenshot of the Terminal with any output that helps you to answer them. This is required image 2.
Each assignment will have a set of suggested extensions. The required tasks constitute about 85% of the assignment, and if you do only the required tasks and do them well you will earn a B+. To earn a higher grade, you need to undertake one or more extensions. The difficulty and quality of the extension or extensions will determine your final grade for the assignment. One complex extension, done well, or 2-3 simple extensions are typical.
- Compute other properties of the data. For example, compare the wind gust values with the wind speed values and see what the variation is. What is the largest wind gust in the data set? What is the largest percentage difference between wind speed and wind gust? What days saw the most wind gust? What days saw the most sun?
- As part of your writeup, show that you can get the same results using different types of patterns in grep. Use your writeup to teach yourself or your peers something about basic regular expressions.
- Instead of looking at PAR at just noon, try looking at multiple times of day and using the average. How different are your counts of sunny/cloudy days using different times or combinations of times?
- Try reformulating task 4 or task 6 to use lists instead of variables. For example, use a list wind instead of the variables wind0, wind1, wind2. If you do this for task 6, it should be easy to change the time range and evaluate conditions over the past 2 hours instead of just the last hour.
Write-up and Hand-in
Turn in your code by putting it into your private hand-in directory on the Courses server. All files should be organized in a folder titled project2 and you should include only those files necessary to run the program. We will grade all files turned in, so please do not turn in old, non-working, versions of files.
Make a new wiki page for your assignment. Put the label cs152f16project2 in the label field on the bottom of the page. But give the page a meaningful title (e.g. Milo's Project 2).
In general, your intended audience for your write-up is your peers not in the class. Your goal should be to be able to use it to explain to friends what you accomplished in this project and to give them a sense of how you did it. Follow the outline below.
- A brief summary of the task, in your own words. This should be no more than a few sentences. Give the reader context and identify the key purpose of the assignment.
- A description of your solution to the tasks, including any text output or images you created. This should be a description of the form and functionality of your final code. Note any unique computational solutions you developed or any insights you gained from your code's output. You may want to incorporate code snippets in your description to point out relevant features. Code snippets should be small segments of code--usually less than a whole function--that demonstrate a particular concept. If you find yourself including more than 5-10 lines of code, it's probably not a snippet.
- A description of any extensions you undertook, including text output or images demonstrating those extensions. If you added any modules, functions, or other design components, note their structure and the algorithms you used.
- A brief description (1-3 sentences) of what you learned. Think about the answer to this question in terms of the stated purpose of the project. What are some specific things you had to learn or discover in order to complete the project?
- A list of people you worked with, including TAs and professors. Include in that list anyone whose code you may have seen, such as those of friends who have taken the course in a previous semester.
- Double-check the label. When you created the page, you should have added a the label cs152f16project2. Make sure it is there.