CS 152: Project 2

Title image Project 2
Spring 2020

Project 2: Extracting Information

The goal of this project is to extract some information from two large data files using a mixture of unix tools and Python. The data comes from two sources: the Goldie buoy located on Great Pond, and the weather center housed at the Maine Lakes Resource Center, also on Great Pond. Both capture information at 15 minute intervals.

Data Files

Download the following data files.


  1. Extract data from the month of July from both files using grep
  2. Extract specific columns from the July data from both files using cut
  3. Combine the files together using paste
  4. Calculate the high and low temperature of both the air and 1m water depth for the month of July.
  5. Calculate the number of sunny or cloudy days and the average amount of sunlight for each category.
  6. Calculate one other statistic of your choice.
  7. Write a CSV file with the day of the month and air temperature at 3pm extracted from the full data.
  8. Pick two other daily values/statistics, extract them, and write them to a CSV file.
  9. Make scatter plots from the data, showing daily values for your statistics.

Set yourself up for working on the project

  1. Mount your directory on the Personal server.
  2. Open the Terminal and navigate to your project2 directory on the Personal server.
  3. If you have not done so, download the CSV files to your project2 dirctory
  4. Open TextWrangler. If you want to look at any of the files you have already created, then open those files.


  1. Extract the data from July

    Use grep to extract the header line plus all lines of the file with measurements taken in July. You will have process each file separately. Direct the output of the two grep commands to the files Goldie2019July.csv and MLRC2019July.csv, respectively.

    + (more detail)

    If you look at either file, which you can do by using

    more Goldie2019.csv

    the first field is a date. Type "q" to get out of the more viewer.

    All of the July dates start with the pattern 07. You can use the \^ symbol to specify that the ensuing pattern has to start the line of the file.

    To direct the output of a grep command to a new file, using the > symbol followed by the name of the file to create.

    If you want to keep a header line on the Goldie data, you can add a second pattern to grep by using the -e flag, which lets you look for one pattern or another pattern. The following command will extract only the header line with the semantic meaning of the column of data.

    grep -e Date Goldie2019.csv

    To extract two patterns, use a second -e followed by the pattern. The following will extract both the header line and all of the lines containing 07/01.

    grep -e Date -e 07/01 Goldie2019.csv

    Double-check that each of the extracted files has the same number of lines. Use the unix command wc to check. The command wc returns the number of lines, words, and characters in a file.

    wc Goldie2019July.csv 
        2977    8953  530271 Goldie2019July.csv
    wc MLRC2019July.csv 
        2977    7792  241043 MLRC2019July.csv

    Including the header line, you should have 2977 lines.

  2. Extract useful columns

    Use the cut command to extract the columns Date/Time, 3m temp, 5m temp, 7m temp, and surface PAR from the Goldie2019July.csv file. These columns correspond to (1) Date/Time, (12) 3m temp, (17) 5m temp, (18) 7m temp, and (22) surface PAR. Direct the output to the file Goldie2019JulyCut.csv.

    Then use cut to extract the columns for air temp, wind speed, and wind direction from the MLRCJuly.csv file, directing the output to the file MLRCJulyCut.csv. These columns correspond to (3) air temp, (6) wind speed, and (8) wind direction.

    + (more detail)

    The cut command takes two options and the name of the file to process. The first option is -d, followed by the character to use to separate fields. The second option is -f, followed by a comma separated list of fields to extract. The following would use a comma to separate the fields and extract columns 2, 4, and 6 from the file mydata.csv.

    cut -d "," -f 2,4,6 mydata.csv

    To direct the output to a file, use the > symbol followed by the file to receive the data.

    cut -d "," -f 2,4,6 mydata.csv > newfile.csv

    Look at the headers in the resulting data files to make sure your cut command grabbed the right columns.

  3. Paste the extracted columns together

    Use the paste command to combine the two cut files together, using a comma to separate the fields. Direct the output to the file GoldieMLRCJuly.csv.

    + (more detail)

    The paste command lets you specify what character to use to keep the fields separated when two files are combined. By defailt it is a tab. To tell paste to use a comma, use the option -d "," followed by the names of the two files to paste.

    The following combines the two files one.csv and two.csv using a comma delimiter and directs the output to the file joined.csv.

    paste -d "," one.csv two.csv > joined.csv

    Double-check that the merged file contains all of the columns from both files.

    Use the GoldieMLRCJuly.csv file for all of the following tasks.

  4. Calculate some max and min statistics

    Create a new file temps.py. Using the file hightemp.py as a template, have your code compute the high and low temperature of the air and the temperature of the water at a 3m depth.

    Your code should nicely print out the high and low air temperature and the high and low water temperature. Include this information in your report.

  5. Calculate the number of sunny and cloudy days

    Create a new file sunny.py. As before, you can use the hightemp.py as a template. Use the value of the PAR column measured at "12:03:00 PM" to determine if a day is sunny or cloudy. If the PAR value at that time is greater than 800 it is sunny. Otherwise, it is cloudy.

    Have your code print out the number of sunny days and the average PAR value of the sunny days. Then it should print out the number of cloudy days and the average PAR value of the cloudy days.

    + (more detail)

    1. Create variables to hold the number of sunny days, the number of cloudy days, the sum of the sunny day PAR values, and the sum of the cloudy day PAR values. Initialize them all to have the value 0.
    2. Inside your loop, you can use the expression A in B to check if the substring A matches some part or all of B. For example, the following code will print 'Yes'.
      smallthing = "hel"
      bigthing = "hello world"
      if smallthing in bigthing:

      Use this functionality to check if the string "12:03:00 PM" is in the date/time field of the line. The date/time field is the first item in the words list: words[0].

    3. If the line is at the proper time, then start a new block of code.

      1. Assign to a variable the result of casting the PAR field, words[5], to a float.
      2. If the PAR value is greater than 800, add one to the number of sunny days and add the PAR value to the sum for sunny days.
      3. Else add one to the number of cloudy days and add the PAR value to the sum for cloudy days.

    4. At the end of your function, calculate the average PAR value for sunny and cloudy days by dividing their respective sums by their respective counts. Print the values nicely.

    The sum of the number of sunny and cloudy days should be 31. Include the data in your report.

  6. Compute a statistic of your choice

    Pick another statistic you find interesting (e.g. min, max, average, standard deviation, percent change). Then write a program that calculates its value and prints the result. In your report, describe what statistic you calculated, how you did it, and what the values were.

  7. Write a CSV file with extracted data and plot it

    Create a new file extract.py. You can use one of your prior files as a template.

    Write a function to extract the air temperature at 3pm (3:03:00 PM) on each day of July. Write to a CSV file the day of the month (1 to 31) and the corresponding 3pm air temperature. Each row of the resulting file should have two values separated by a comma.

    When you write the CSV file, make sure to include a header row at the top of the file with appropriate names for each column.

    Use your favorite plotting tool to generate a plot of day (x-axis) versus 3pm air temperature (y-axis). Include the plot in your report.

  8. Extract two other variables to plot

    Create a new file extract2.py

    Choose two variables and a time schedule for extracting them. For example, you might pick the water temperature at 5m depth at 2am, 8am, 2pm, and 8pm each day. It is up to you how many data points to grab, but you cannot grab all of them. Pick variables that might be interesting to contrast. Use the same time schedule for both variables.

    Write the code to extract the values you chose and write them to a CSV file. Each row of the file should have three values separated by commas. Make sure the file has a header row.

    Use your favorite plotting tool to make a plot with time (x-axis) and value (y-axis). Use a metric of a day as having value 1. So if you extract data at midnight, 6am, noon, and 6pm, use the time values N, N.25, N.5, and N.75, where N is the day of the month.

    Include your plots in your report along with a description of what values you extracted and on what time schedule.

Follow-up Questions

  1. What does it mean to create and write to a file?
  2. What is a loop and why are they useful?
  3. Why should you learn unix command line tools?
  4. Who is your favorite scientist?


Each assignment will have a set of suggested extensions. The required tasks constitute about 85% of the assignment, and if you do only the required tasks and do them well you will earn a B+. To earn a higher grade, you need to undertake one or more extensions. The difficulty and quality of the extension or extensions will determine your final grade for the assignment. One complex extension, done well, or 2-3 simple extensions are typical.

Submit your code

Turn in your code (all files ending with .py) by putting it in a directory in the Courses server. On the Courses server, you should have access to a directory called CS152, and within that, a directory with your user name. Within this directory is a directory named private. Files that you put into that private directory you can edit, read, and write, and the professor can edit, read, and write, but no one else. To hand in your code and other materials, create a new directory, such as project1, and then copy your code into the project directory for that week. Please submit only code that you want to be graded.

When submitting your code, double check the following.

  1. Is your name at the top of each code file?
  2. Does every function have a comment or docstring specifying what it does?
  3. Is your handin project directory inside your Private folder on Courses?

Write your project report

For CS 152 please use Google Docs to write your report. Create a new doc for each project. Start the doc with a title and your name. Attach the doc to your project on Google classroom. Make sure you click submit when you are done. The graders cannot provide feedback unless you click submit.

Your intended audience for your report is your peers not in the class. From week to week you can assume your audience has read your prior reports. Your goal should be to be able to use it to explain to friends what you accomplished in this project and to give them a sense of how you did it.

Your project report should contain the following elements. Please include a header for each section.