CS 151 Project 2

Project 2: Extracting information

Main course web page

The purpose of this project is to take the next step in working with data to try extracting some kind of information from patterns in the data. You will also make more use of functions and parameters to encapsulate code and build up a useful library of routines.


Tasks

  1. If you haven't already set yourself up for working on the project, then do so now.
    1. Mount your directory on the Personal server.
    2. Open the Terminal and navigate to your Project2 directory on the Personal server.
    3. Open TextWrangler. If you want to look at any of the files you have already created, then open those files.
  2. Download the following CSV file, which contains weather data from the Maine Lakes Resource Center on five minute intervals since May. If you open the file, you'll note that it has a lot of header lines giving details about the data itself.

    Maine Lakes Resource Data 16

    This data contains 14 fields. For now, we're going to concern ourselves with field 8, PAR. PAR stands for Photosynthetically Active Radiation. Basically, it's a measure of much visible sunlight is shining on the sensor.

    The goal of step one in this project is to figure out how many cloudy days and how many sunny days occurred in June and July of 2015. We're going to do this by looking at the PAR value at noon on each day of June and July and seeing how many of those PAR measurements were above 800uE and how many were below.

    • The first sub-task is to create a grep command that gives all of the lines of data from June and July. Looking at the time and date field, it has the form

      digit(s)/digit(s)/15 hour(s):minute

      Note that the month and day fields and the hour field can have either one or two digits in it. We want to find all of the date patterns that have a 6 or 7 in the month field, any one or two digits in the month field, and that end in /15. Develop the appropriate grep command and test it on the data to make sure you are getting every line from June and July 2015.

      The form of your command will look like this:

      grep <juneJulyExpression> MaineLakesResourceCenterData16.csv

      where <juneJulyExpression> will be replaced by the actual expression (and don't forget to put it in quotes).

    • The second sub-task is to extract the PAR value for just those lines that occur at noon. This is broken into two sub-sub-tasks:
      • Find all of the lines containing the string 12:00. Pipe the output of the grep command from step one to a grep command that extracts all of the noon lines. Your command should have the form.

        grep <juneJulyExpression> MaineLakesResourceCenterData16.csv | grep <noonExpression>

        where <noonExpression> will be replaced by the expression to extract only those lines with 12:00 in them (don't forget your quotes). Test this step.

      • Pipe the output to a cut command that extracts field 8 delimited by commas. Test that this result gives you just the PAR column. It should give you 61 PAR values. Pipe the output of the cut command to wc if you want to get a line/word/character count. Your terminal command for doing that should look something like below.

        grep <juneJulyExpression> MaineLakesResourceCenterData16.csv | grep <noonExpression> | cut <arguments> | wc

    • The third sub-task is to write a Python program that reads the stream of PAR data and classifies each value as either a cloudy day or a sunny day. Use 800uE as the threshold for sunny/cloudy. A value less than 800 means the day is pretty overcast. Your program should also sum all of the PAR values for sunny days and sum the PAR values for cloudy days and then compute and print the average for each, along with the total number of each type of day. There should be 42 sunny days and 19 cloudy days at noon in June and July 2015.
      import sys
      
      def main(stdin):
        # assign to nSun the value 0
        # assign to sumSun the value 0
        # assign to nCloud the value 0
        # assign to sumCloud the value 0
      
        # assign to buf the result of calling readline using the stdin parameter
      
        while buf.strip() != '':
          # assign to par the result of casting buf to a float
          # if the par value is greater than 800
            # increment nSun by 1
            # increment sumSun by the value of par
          # else
            # increment nCloud by 1
            # increment sumCloud by the value of par
      
          # assign to buf the result of calling readline using the stdin parameter
      
        # print out the number of sunny days and the average par of sunny days
        # print out the number of cloudy days and the average par of cloudy days
      
      if __name__ == "__main__":
        main(sys.stdin)
      

    Take a screenshot of the Terminal with the output. You should include this in your wiki page write-up as an image to demonstrate the correctness of your code. This is required image 1.

  3. The second main task is to extract the percentage difference between the temperature at 1m and the temperature at 7m from the live buoy data (that is the data you need to curl) for June and July of 2015.

    We can break this down into three sub-tasks. The first is to identify all the data lines from June and July, similar to the first sub-task above. The second is to cut out the date/time, 1m, and 7m fields. The third is to read in each line, extract the 2nd and 3rd fields, convert them to floats and calculate the percent difference.

    • Write a grep command that identifies all of the live buoy data values from June and July. You can test it by piping the output to less.
    • Pipe the grep command to a cut command that extracts the date/time (field 1), 1m temperature (field 11), and 7m temperature (field 18) fields. Test that you are getting a result with three fields and that the fields have the appropriate values (hint: you can open the .csv file in TextWrangler or Excel to determine which values you should expect).
    • Write a python program (name it mixing.py) that reads lines from stdin. Use the split function to divide a line by commas. Then cast the second and third words to floats using the float function. These are the 1m and 7m temperatures. Compute the percent change C = (1m - 7m)/7m, then print out the date/time and percent change. Here are comments to guide you as you write the code
      import sys
      
      def main(stdin):
        # assign to buf the result of calling readline using the stdin parameter
      
        while buf.strip() != '':
          # assign to words the result of calling buf.split(',')
          # assign to temp1m the result of casting words[1] to a float
          # assign to temp7m the result of casting words[2] to a float
          # assign to change the expression (temp1m - temp7m)/temp7m
          # print out words[0] and change, separated by a comma
      
          # assign to buf the result of calling readline using the stdin parameter
      
      if __name__ == "__main__":
        main(sys.stdin)
      
    • When you run the program, redirect the output to the file mixing.csv. You can do this using the > symbol.

      <other commands> | python mixing.py > mixing.csv

  4. Our goal for this project is to examine the surface conditions--wind speed, precipitation, and sunshine--when the difference in temperatures between the 1m and 7m measurements drops. The wind speed and precipitation are available as 5min measurements in the Maine Lakes Center file (that is the .csv file you downloaded and used in task 2), while the buoy data (the data you curl) is available at 15min measurements. For this task, we're going to modify the wind speed, PAR, and precipitation to find the 15min averages from the Maine Lakes Center file.

    There are three steps to this process. The first is to extract all of the June and July data, the second is to cut out the four fields we want (date/time, wind speed, PAR, and precipitation), and the third is to pass it to a python program that averages three readings to generate and print out a single reading every 15min.

    • Generate a grep command that extracts all of the lines from June and July. This will be the same as the line you used in task 2. Test it again.
    • Generate a cut command that extracts the date/time (field 2), Wind Speed (field 4), PAR (field 8), and Rain (field 13). Pipe the output of grep from the previous step to the cut command and test the result.
    •  Write a Python program (name it energy.py) that reads through the data, extracts the values from each line, and remembers the last three wind, rain, and PAR values. If the line is on a 15 min interval, then print out the average of the last three wind, rain, and PAR values.
      import sys
      
      def main(stdin):
      
        # assign to wind0 the value 0.0
        # assign to wind1 the value 0.0
        # assign to wind2 the value 0.0
      
        # assign to rain0 the value 0.0
        # assign to rain1 the value 0.0
        # assign to rain2 the value 0.0
      
        # assign to par0 the value 0.0
        # assign to par1 the value 0.0
        # assign to par2 the value 0.0
        
        # assign to datetime the empty string ''
      
        # assign to buf the result of calling readline using the stdin parameter
      
        while buf.strip() != '':
          
          # assign to wind2 the value in wind1
          # assign to wind1 the value in wind0
      
          # assign to rain2 the value in rain1
          # assign to rain1 the value in rain0
      
          # assign to par2 the value in par1
          # assign to par1 the value in par0
      
          # assign to words the result of calling split on the buf variable with a comma as argument
      
          # assign to datetime the value in words[0]
          # assign to wind0 the result of casting words[1] to a float
          # assign to par0 the result of casting words[2] to a float
          # assign to rain0 the result of casting words[3] to a float
      
          # if any of the strings ":00:", ":15:", ":30:", or ":45:" are in the datetime string
            # assign to avgwind the average of wind0, wind1, and wind2
            # assign to avgpar the average of par0, par1, and par2
            # assign to avgrain the average of rain0, rain1, and rain2
      
            # print the datetime, average wind, average PAR, and average rain, separated by commas
      
         # assign to buf the result of calling readline using the stdin parameter
      
      if __name__ == "__main__":
        main(sys.stdin)
      

      Remember to add the appropriate comments to the top of the file (including how to run the program). Test the program and make sure it is printing out values only on 15min intervals. Redirect the output to a file called energy.csv.

  5. After the last two tasks, you should have a file called mixing.csv and energy.csv. They should have the same number of lines (use wc to check this).

    Use the Unix tool paste to combine the two files together, using a comma as the delimiter. Redirect the output to the file blend.csv. Check that the two date/time files line up properly in the output file.

  6. The final task is to identify strong mixing events--times when the percent change in temperature is dropping quickly--and then look at the corresponding surface conditions.

    The blend file contains all of the necessary data for all of the dates of interest, so we don't need to use grep or cut. Instead, we just need to cat the file and pipe it to a Python program (name it find_events.py) that does the analysis.

    The python program needs to remember the last hour's worth of mixing, wind, rain, and PAR values. That means you will need variables to hold each value. For example, you will want to initizlize mix0, mix1, mix2, and mix3 to zero at the start of your main function.

    In the main loop, check to see if the percent change has dropped by 5% or more in the last hour by testing if mix3 - mix0 > 0.05. If it has changed, then print out the date/time, the mix value, and the sum of the last hour's worth of wind, rain, and PAR.

    Look at the output of your program and see if there are patterns that arise. What times of day and what conditions seem to occur when the temperature differences are decreasing quickly? How many unique mixing events occured in the two months? Answer these questions explicitly in your write-up. Take a screenshot of the Terminal with any output that helps you to answer them. This is required image 2.


Extensions

Each assignment will have a set of suggested extensions. The required tasks constitute about 85% of the assignment, and if you do only the required tasks and do them well you will earn a B+. To earn a higher grade, you need to undertake one or more extensions. The difficulty and quality of the extension or extensions will determine your final grade for the assignment. One complex extension, done well, or 2-3 simple extensions are typical.


Write-up and Hand-in

Turn in your code by putting it into your private hand-in directory on the Courses server. All files should be organized in a folder titled "Project 2" and you should include only those files necessary to run the program. We will grade all files turned in, so please do not turn in old, non-working, versions of files.

Make a new wiki page for your assignment. Put the label cs151sf15project2 in the label field on the bottom of the page. But give the page a meaningful title (e.g. Milo's Project 2).

In general, your intended audience for your write-up is your peers not in the class. Your goal should be to be able to use it to explain to friends what you accomplished in this project and to give them a sense of how you did it. Follow the outline below.