CS 152: Project 2

Title image Project 2
Spring 2017

Project 2: Extracting information

The purpose of this project is to take the next step in working with data to try extracting some kind of information from patterns in the data. You will also make more use of functions and parameters to encapsulate code and build up a useful library of routines.


If you haven't already set yourself up for working on the project, then do so now.

  1. Mount your directory on the Personal server.
  2. Open the Terminal and navigate to your project2 directory on the Personal server.
  3. Open TextWrangler. If you want to look at any of the files you have already created, then open those files.

Tasks

  1. Download the following CSV file, which contains weather data from the Maine Lakes Resource Center on five minute intervals since from March 14th through July 19th, 2016. If you open the file, you'll note that it has a lot of header lines giving details about the data itself.

    Maine Lakes Resource Data 19

    This data contains 14 fields. For now, we're going to concern ourselves with field 8, PAR. PAR stands for Photosynthetically Active Radiation. Basically, it's a measure of much visible sunlight is shining on the sensor.

    The goal of step one in this project is to figure out how many cloudy days and how many sunny days occurred in May, June, and July of 2016 (Note, the weather station stopped collecting data on July 19th at 8:30am, so that will be the last entry in the data). We're going to identify cloudy days by looking at the PAR value at noon on each day of May, June, and July and seeing how many of those PAR measurements were above 800uE and how many were below.

    • The first sub-task is to create a grep command that gives all of the lines of data from May, June and July. Looking at the time and date field, it has the form

      digit(s)/digit(s)/16 hour(s):minute

      Note that the month and day fields and the hour field can have either one or two digits in it. We want to find all of the date patterns that have a 5, 6, or 7 in the month field, any one or two digits in the month field, and that end in /16. Develop the appropriate grep command and test it on the data to make sure you are getting every line from May, June, and July 2016.

      The form of your command will look like this:

      grep <mayJuneJulyExpression> MLRC_19.csv

      where <mayJuneJulyExpression> will be replaced by the actual expression (and don't forget to put it in quotes).

    • The second sub-task is to extract the PAR value for just those lines that occur at noon. This is broken into two sub-sub-tasks:
      • Find all of the lines containing the string 12:00. Pipe the output of the grep command from step one to a grep command that extracts all of the noon lines. Your command should have the form.

        grep <mayJuneJulyExpression> MLRC_19.csv | grep <noonExpression>

        where <noonExpression> will be replaced by the expression to extract only those lines with 12:00 in them (don't forget your quotes). Test this step.

      • Pipe the output to a cut command that extracts field 8 delimited by commas. Test that this result gives you just the PAR column. It should give you 79 PAR values. Pipe the output of the cut command to wc if you want to get a line/word/character count (wc is a really useful tool for counting lines/words/characters in a file). Your terminal command for doing that should look something like below.

        grep <mayJuneJulyExpression> MLRC_19.csv | grep <noonExpression> | cut <arguments> | wc

    • The third sub-task is to write a Python program (sunny.py) that reads the stream of PAR data and classifies each value as either a cloudy day or a sunny day. Use 800uE as the threshold for sunny/cloudy. A value less than 800 means the day is pretty overcast. Your program should also sum all of the PAR values for sunny days and sum the PAR values for cloudy days and then compute and print the average for each, along with the total number of each type of day. There should be 58 sunny days and 21 cloudy days at noon in May, June, and July 2016 (through July 19).
      # Your Name
      # Name of the file
      # Purpose of the file
      # Example Unix command for running the file
      #  (This should include the grepping and
      #   cutting and piping.)
      import sys
      
      def main(stdin):
        # assign to nSun the value 0
        # assign to sumSun the value 0
        # assign to nCloud the value 0
        # assign to sumCloud the value 0
      
        # assign to buf the result of calling readline using the stdin parameter
      
        while buf.strip() != '':
          # assign to par the result of casting buf to a float
          # if the par value is greater than 800
            # increment nSun by 1
            # increment sumSun by the value of par
          # else
            # increment nCloud by 1
            # increment sumCloud by the value of par
      
          # assign to buf the result of calling readline using the stdin parameter
      
        # print out the number of sunny days and the average par of sunny days
        # print out the number of cloudy days and the average par of cloudy days
      
      if __name__ == "__main__":
        main(sys.stdin)
      

    Notice the comments at the beginning of the file. You should replace this text with the relevant information for this particular file.

    Take a screenshot of the Terminal with the output. You should include this in your wiki page write-up as an image to demonstrate the correctness of your code. This is required image 1.

  2. The second main task is to extract the percentage difference between the temperature at 1m and the temperature at 7m from the live buoy data (that is the data you need to curl) for May of 2016. Use the LEA buoy data at the link below for this task.

    http://schupflab.labs.keyes.colby.edu/buoy/3100_iSIC.csv

    We can break this down into three sub-tasks. The first is to identify all the data lines from May, similar to the first sub-task above. The second is to cut out the date/time, 1m, and 7m fields (fields 9 and 18). The third is to read in each line, extract the 2nd and 3rd fields, convert them to floats and calculate the percent difference.

    • Write a grep command that identifies all of the live buoy data values from May. You can test it by piping the output to less.
    • Pipe the grep command to a cut command that extracts the date/time (field 1), 1m temperature (field 9), and 7m temperature (field 18) fields. Test that you are getting a result with three fields and that the fields have the appropriate values (hint: you can open the .csv file in TextWrangler or Excel to determine which values you should expect).
    • Write a python program (name it mixing.py) that reads lines from stdin. Use the split function to divide a line by commas. Then cast the second and third words to floats using the float function. These are the 1m and 7m temperatures. Compute the percent change C = (1m - 7m)/7m, then print out the date/time and percent change. Here are comments to guide you as you write the code
      # Your Name
      # Name of the file
      # Purpose of the file
      # Example Unix command for running the file
      #  (This should include the grepping and
      #   cutting and piping.)
      import sys
      
      def main(stdin):
        # assign to buf the result of calling readline using the stdin parameter
      
        while buf.strip() != '':
          # assign to words the result of calling buf.split(',')
          # assign to temp1m the result of casting words[1] to a float
          # assign to temp7m the result of casting words[2] to a float
          # assign to change the expression (temp1m - temp7m)/temp7m
          # print out words[0] and change, separated by a comma
      
          # assign to buf the result of calling readline using the stdin parameter
      
      if __name__ == "__main__":
        main(sys.stdin)
      
    • When you run the program, redirect the output to the file mixing.csv. You can do this using the > symbol.

      <other commands> | python mixing.py > mixing.csv

      Be sure to include this command in the header comments for your code file.

  3. Our goal for this project is to examine the surface conditions--wind speed, gust speed, and sunshine--when the difference in temperatures between the 1m and 7m measurements drops. The wind speed and gust speed are available as 5min measurements in the Maine Lakes Center file (that is the .csv file you downloaded and used in task 1), while the buoy data (the data you curl) is available at 15min measurements.

    For this task, we're going to modify the wind speed, gust speed, and PAR to find the 15min averages from the Maine Lakes Resource Center file.

    There are three steps to this process. The first is to extract all of the May data, the second is to cut out the four fields we want (date/time, wind speed, wind gusts, and PAR), and the third is to pass it to a Python program that averages three readings to generate and print out a single reading every 15min.

    • Generate a grep command that extracts all of the lines from May from the MRLC data. This will be identical to the line you used in the previous task. Test it again.
    • Generate a cut command that extracts the date/time (field 2), Wind Speed (field 4), Gust Speed (field 5), and PAR (field 8). Pipe the output of grep from the previous step to the cut command and test the result.
    • Write a Python program (name it energy.py) that reads through the data, extracts the values from each line, and remembers the last three wind, gust, and PAR values. If the line is on a 15 min interval, then print out the average of the last three wind, gust, and PAR values.
      # Your Name
      # Name of the file
      # Purpose of the file
      # Example Unix command for running the file
      #  (This should include the grepping and
      #   cutting and piping.)
      import sys
      
      def main(stdin):
      
        # assign to wind0 the value 0.0
        # assign to wind1 the value 0.0
        # assign to wind2 the value 0.0
      
        # assign to gust0 the value 0.0
        # assign to gust1 the value 0.0
        # assign to gust2 the value 0.0
      
        # assign to par0 the value 0.0
        # assign to par1 the value 0.0
        # assign to par2 the value 0.0
        
        # assign to datetime the empty string ''
      
        # assign to buf the result of calling readline using the stdin parameter
      
        while buf.strip() != '':
          
          # assign to wind2 the value in wind1
          # assign to wind1 the value in wind0
      
          # assign to gust2 the value in gust1
          # assign to gust1 the value in gust0
      
          # assign to par2 the value in par1
          # assign to par1 the value in par0
      
          # assign to words the result of calling split on the buf variable with a comma as argument
      
          # assign to datetime the value in words[0]
          # assign to wind0 the result of casting words[1] to a float
          # assign to gust0 the result of casting words[2] to a float
          # assign to par0 the result of casting words[3] to a float
      
          # if any of the strings ":00:", ":15:", ":30:", or ":45:" are in the datetime string
            # assign to avgwind the average of wind0, wind1, and wind2
            # assign to avggust the average of gust0, gust1, and gust2
            # assign to avgpar the average of par0, par1, and par2
      
            # print the datetime, average wind, average gust, and average PAR separated by commas
      
         # assign to buf the result of calling readline using the stdin parameter
      
      if __name__ == "__main__":
        main(sys.stdin)
      

      Remember to update the comments the top of your code file to say which this file's purpose is and how to run it. Test the program and make sure it is printing out values only on 15min intervals. Redirect the output to a file called energy.csv.

  4. After the last two tasks, you should have a file called mixing.csv and energy.csv. They should have the same number of lines (use wc to check this - they should each have 2976 lines).

    Use the Unix tool paste to combine the two files together, using a comma as the delimiter. Redirect the output to the file blend.csv. Check that the two date/time files line up properly in the output file.

  5. The final task is to identify strong mixing events--times when the percent change in temperature is dropping quickly--and then look at the corresponding surface conditions.

    The blend file contains all of the necessary data for all of the dates of interest, so we don't need to use grep or cut. Instead, we just need to cat the file and pipe it to a Python program (name it find_events.py) that does the analysis.

    The python program needs to remember the last hour's worth of mixing, wind, gust, and PAR values. That means you will need variables to hold each value. For example, you will want to initialize mix0, mix1, mix2, and mix3 to zero at the start of your main function.

    In the main loop, check to see if the percent change has dropped by 5% or more in the last hour by testing if mix3 - mix0 > 0.05. If it has changed, then print out the date/time, the mix value, and the sum of the last hour's worth of wind, gust, and PAR.

    Look at the output of your program and see if there are patterns that arise. What times of day and what conditions seem to occur when the temperature differences are decreasing quickly? How many unique mixing events occured in the two months? Answer these questions explicitly in your write-up. Take a screenshot of the Terminal with any output that helps you to answer them. This is required image 2.


Extensions

Each assignment will have a set of suggested extensions. The required tasks and write-up constitute about 85% of the assignment, and if you do only the required tasks and do both well you will earn a B+. To earn a higher grade, you need to undertake one or more extensions. The difficulty and quality of the extension or extensions will determine your final grade for the assignment. One complex extension, done well, or 2-3 simple extensions are typical.


Write-up and Hand-in

Turn in your code by putting it into your private hand-in directory on the Courses server. All files should be organized in a folder titled project2 and you should include only those files necessary to run the program. We will grade all files turned in, so please do not turn in old, non-working, versions of files.

Make a new wiki page for your assignment. Put the label cs152s17project2 in the label field on the bottom of the page. But give the page a meaningful title (e.g. Milo's Project 2).

In general, your intended audience for your write-up is your peers in CS151 (i.e. students who know the same amount of Python but are not doing the same projects). Your goal should be to be able to use it to explain to friends what you accomplished in this project and to give them a sense of how you did it. Follow the outline below.