Due: , 11:59 pm
The purpose of this project is to take the next step in working with data to try extracting some kind of information from patterns in the data. You will also make more use of functions and parameters to encapsulate code and build up a useful library of routines.
If you haven't already set yourself up for working on the project, then do so now:
Download the following CSV file, which contains weather data from the Maine Lakes Resource Center on five minute intervals from March 14th through July 19th, 2016. If you open the file, you'll note that it has a lot of header lines giving details about the data itself.
This data contains 14 fields. For now, we're going to concern ourselves with field 8, PAR. PAR stands for Photosynthetically Active Radiation. Basically, it's a measure of how much visible sunlight is shining on the sensor.
The goal of this step is to figure out how many cloudy days and how many sunny days occurred in June and July of 2016. (Note, the weather station stopped collecting data on July 19th at 8:30am, so that will be the last entry in the data.) We're going to identify cloudy days by looking at the PAR value at noon on each day of June and July and seeing how many of those PAR measurements were above 800uE and how many were below.
The first sub-task is to create a grep command that gives all of the lines of data from June and July. Looking at the time and date field, it has the form
Note that the month and day fields and the hour field can have either one or two digits in it. We want to find all of the date patterns that have a 6 or 7 in the month field, any one or two digits in the month field, and that end in /16. Develop the appropriate grep command and test it on the data to make sure you are getting every line from June and July 2016.
The form of your command will look like this:
grep <juneJulyExpression> MLRC_19.csv
where <juneJulyExpression> will be replaced by the actual expression (and don't forget to put it in quotes).
The second sub-task is to extract the PAR value for just those lines that occur at noon. This is broken into two sub-sub-tasks:
Find all of the lines containing the string 12:00. Pipe the output of the grep command from step one to a grep command that extracts all of the noon lines. Your command should have the form.
grep <juneJulyExpression> MLRC_19.csv | grep <noonExpression>
where <noonExpression> will be replaced by the expression to extract only those lines with 12:00 in them (don't forget your quotes). Test this step.
Pipe the output to a cut command that extracts field 8 delimited by commas. Test that this result gives you just the PAR column. It should give you 48 PAR values. Pipe the output of the cut command to wc if you want to get a line/word/character count (wc is a really useful tool for counting lines/words/characters in a file). Your terminal command for doing that should look something like below.
grep <juneJulyExpression> MLRC_19.csv | grep <noonExpression> | cut <arguments> | wc
The third sub-task is to write a Python program (sunny.py) that reads the stream of PAR data and classifies each value as either a cloudy day or a sunny day. Use 800 uE as the threshold for sunny/cloudy. A value less than 800 means the day is pretty overcast. Your program should also sum all of the PAR values for sunny days, sum the PAR values for cloudy days, and then compute and print the average for each, along with the total number of each type of day. There should be 37 sunny days and 11 cloudy days at noon in June and July 2016 (through July 19). Format your print statements so that they are well organized with clear labels.
# Your Name # Today's Date # Course and Project Information # # Name of the file # Purpose of the file # # Example Unix command for running the file # (This should include the grepping, cutting, and piping.) import sys def main(stdin): # assign to nSun the value 0 # assign to sumSun the value 0 # assign to nCloud the value 0 # assign to sumCloud the value 0 # assign to buf the result of calling readline using the stdin parameter while buf.strip() != '': # assign to par the result of casting buf to a float # if the par value is greater than 800 # increment nSun by 1 # increment sumSun by the value of par # else # increment nCloud by 1 # increment sumCloud by the value of par # assign to buf the result of calling readline using the stdin parameter # print out the number of sunny days and the average par of sunny days # print out the number of cloudy days and the average par of cloudy days if __name__ == "__main__": main(sys.stdin)
Notice the comments at the beginning of the file. You should replace this text with the relevant information for this particular file.
If you need to use unix commands to run the file (as you do here!), then be sure to include an example of the unix command in the comment header.
Take a screenshot of the Terminal with the output. You should include this in your wiki page write-up as an image to demonstrate the correctness of your code. (As usual, please do not include much if any of your code in images for your report--only a couple of lines, tops! In general, wiki pages are public, and your code should not be made public.) This is required image 1.
The second main task is to extract the percentage difference between the temperature at 1m and the temperature at 7 m from the live buoy data (that is the data for which you used curl in lab) for June of 2016. Use the buoy data at the link below for this task.
Note that the buoy data in this file stop after June 25. In the exercises below, we'll restrict ourselves to all of the available data from June--that is, up through and including June 25, but not beyond.
We can break our main task down into three sub-tasks. The first is to identify all the data lines from June, similar to the first sub-task above. The second is to cut out the date/time, 1 m, and 7 m fields (fields 11 and 18). The third is to read in each line, extract the temperature fields, convert them to floats and calculate the percent difference.
Write a grep command that identifies all of the buoy data values from June. You can test it by piping the output to less.
Pipe the grep command to a cut command that extracts the date/time (field 1), 1 m temperature (field 11), and 7 m temperature (field 18) fields. Test that you are getting a result with three fields and that the fields have the appropriate values (hint: you can open the .csv file in TextWrangler or Excel to determine which values you should expect).
Write a python program (name it mixing.py) that reads lines from stdin. Use the split function to divide a line by commas. Then cast the second and third words to floats using the float function. These are the 1 m and 7 m temperatures. Compute the percent change C = (1m_value - 7m_value) / 7m_value, then print out the date-time and percent change. Here are comments to guide you as you write the code:
# Your Name # Today's Date # Course and Project Information # # Name of the file # Purpose of the file # # Example Unix command for running the file # (This should include the grepping, cutting, and piping.) import sys def main(stdin): # assign to buf the result of calling readline using the stdin parameter while buf.strip() != '': # assign to words the result of calling buf.split(',') # assign to temp1m the result of casting words to a float # assign to temp7m the result of casting words to a float # assign to change the expression (temp1m - temp7m)/temp7m # print out words and change, separated by a comma # assign to buf the result of calling readline using the stdin parameter if __name__ == "__main__": main(sys.stdin)
When you run the program, redirect the output to the file mixing.csv. You can do this using the > symbol:
<other commands> | python3 mixing.py > mixing.csv
Be sure to include this command in the header comments for your code file.
Our goal for this task is to examine the surface conditions -- wind speed, gust speed, and sunshine -- when the difference in temperatures between the 1 m and 7 m measurements drops. The wind speed and gust speed are available in a Maine Lakes Resource Center file (which you can download).
These data are present at 5 minute intervals--this is different from the buoy data (the data for which you use curl, above), which is present at 15 minute intervals.
For this task, we're going to modify the wind speed, gust speed, and PAR to find the 15 minute averages from our new Maine Lakes Resource Center file. Note that, like the buoy data file, the data in MLRC_19_a do not extend beyond June 25, so we will once again consider all June data that are available--that is, up through and including June 25, but not beyond.
There are three steps to this process. The first is to extract all of the June data, the second is to cut out the four fields we want (date/time, wind speed, wind gusts, and PAR), and the third is to pass it to a Python program that averages three readings to generate and print out a single reading every 15 minutes.
Generate a grep command that extracts all of the lines from June from this MRLC data. Look back to the grep command you used in the previous task for inspiration. (Test the command to make sure it works.)
Generate a cut command that extracts the date/time (field 2), Wind Speed (field 4), Gust Speed (field 5), and PAR (field 8). Pipe the output of grep from the previous step to the cut command and test to make sure the results are correct.
Write a Python program (name it energy.py) that reads through the data, extracts the values from each line, and remembers the last three wind, gust, and PAR values. If the line is on a 15 min interval, then print out the average of the last three wind, gust, and PAR values.
# Your Name # Today's Date # Course and Project Information # # Name of the file # Purpose of the file # # Example Unix command for running the file # (This should include the grepping, cutting, and piping.) import sys def main(stdin): # assign to wind0 the value 0.0 # assign to wind1 the value 0.0 # assign to wind2 the value 0.0 # assign to gust0 the value 0.0 # assign to gust1 the value 0.0 # assign to gust2 the value 0.0 # assign to par0 the value 0.0 # assign to par1 the value 0.0 # assign to par2 the value 0.0 # assign to datetime the empty string '' # assign to buf the result of calling readline using the stdin parameter while buf.strip() != '': # assign to wind2 the value in wind1 # assign to wind1 the value in wind0 # assign to gust2 the value in gust1 # assign to gust1 the value in gust0 # assign to par2 the value in par1 # assign to par1 the value in par0 # assign to words the result of calling split on the buf variable with a comma as argument # assign to datetime the value in words # assign to wind0 the result of casting words to a float # assign to gust0 the result of casting words to a float # assign to par0 the result of casting words to a float # if any of the strings ":00:", ":15:", ":30:", or ":45:" are in the datetime string # assign to avgwind the average of wind0, wind1, and wind2 # assign to avggust the average of gust0, gust1, and gust2 # assign to avgpar the average of par0, par1, and par2 # print the datetime, average wind, average gust, and average PAR separated by commas # assign to buf the result of calling readline using the stdin parameter if __name__ == "__main__": main(sys.stdin)
Remember to update the comments at the top of your code file to say which this file's purpose is and how to run it. Test the program and make sure it is printing out values only on 15min intervals. Redirect the output to a file called energy.csv.
After the last two tasks, you should have files called mixing.csv and energy.csv. They should each have 2400 lines (use wc to check this).
Use the Unix tool paste to combine the two files together, using a comma as the delimiter. Redirect the output to the file blend.csv. Check that the two date/time files line up properly in the output file.
The final task is to identify strong mixing events--times when the percent change in temperature is dropping quickly--and then look at the corresponding surface conditions.
The blend file contains all of the necessary data for all of the dates of interest, so we don't need to use grep or cut. Instead, we just need to cat the file (i.e., type cat blend.csv on the command line) and pipe the output to a Python program (name it find_events.py) that does the analysis.
The Python program needs to remember the last hour's worth of mixing, wind, gust, and PAR values. That means you will need variables to hold each value. For example, you will want to initialize mix0, mix1, mix2, and mix3 to zero at the start of your main function.
In the main loop, check to see if the percent change has dropped by 5%
or more in the last hour by testing if
mix3 - mix0 > 0.05. If it
has changed, then print out the date/time, the mix3 value, and the sum
of the last hour's worth of wind, gust, and PAR.
Look at the output of your program and see if there are patterns that arise. What times of day and what conditions seem to occur when the temperature differences are decreasing quickly? How many unique mixing events occured in the data set? Answer these questions explicitly in your write-up. Take a screenshot of the Terminal with any output that helps you to answer them. This is required image 2.
Each assignment will have a set of suggested extensions. The required tasks and write-up constitute about 86% of the assignment. To earn a higher grade than that, you need to undertake one or more extensions. The difficulty and quality of the extension or extensions will determine your final grade for the assignment. One complex extension, done well, or 2-3 simple extensions are typical.
These are only examples to help you start thinking of the unlimited possible ways you could extend the project. You are strongly encouraged to design your own extensions to suit your interests and show off your computational thinking skills.
Whichever extensions you choose, be sure to discuss your motivation, design process, implementation, and results in the writeup. A screenshot of your results is usually a great idea.
Turn in your code by putting it into your private hand-in directory on the Courses server. All files should be organized in a folder titled project2 and you should include only those files necessary to run the program. We will grade all files turned in, so please do not turn in old, non-working, versions of files.
Make a new wiki page for your
assignment. Put the label
cs152s19project2 in the label field on the
bottom of the page. But give the page a meaningful title (e.g. Eric's CS152
In general, your intended audience for your write-up is your peers in CS152 (i.e. students who know the same amount of Python but are not doing the same projects). Your goal should be to be able to use it to explain to friends what you accomplished in this project and to give them a sense of how you did it. Follow the outline below.
cs152s19project2. Make sure it is there.
© 2019 Eric Aaron (with contributions from Colby CS colleagues).