Project 2: Extracting Information
The goal of this project is to extract some information from two large data files using a mixture of unix tools and Python. The data comes from two sources: the Goldie buoy located on Great Pond, and the weather center housed at the Maine Lakes Resource Center, also on Great Pond. Both capture information at 15 minute intervals.
- Extract data from the month of July from both files using grep
- Extract specific columns from the July data from both files using cut
- Combine the files together using paste
- Calculate the high and low temperature of both the air and 1m water depth for the month of July.
- Calculate the number of sunny or cloudy days and the average amount of sunlight for each category.
- Calculate two other statistics of your choice.
- Write a CSV file with the day of the month and air temperature at 3pm extracted from the full data.
- Pick two other daily values/statistics, extract them, and write them to a CSV file.
- Make scatter plots from the data, showing daily values for your statistics.
Set yourself up for working on the project
- Mount your directory on the Personal server.
- Open the Terminal and navigate to your project2 directory on the Personal server.
- If you have not done so, download the CSV files to your project2 dirctory
- Open TextWrangler. If you want to look at any of the files you have already created, then open those files.
- Extract the data from July
Use grep to extract the lines of the file with measurements taken in July. You will have to run it on each file separately. Direct the output of the two grep commands to the files GoldieJuly.csv and MLRCJuly.csv, respectively.
+ (more detail)
If you look at either file, which you can do by using
the first field is a date. Type "q" to get out of the more viewer.
All of the July dates start with the pattern
07. You can use the
\^symbol to specify that the ensuing pattern has to start the line of the file.
To direct the output of a grep command to a new file, using the > symbol followed by the name of the file to create.
If you want to keep a header line on the Goldie data, you can add a second pattern to grep by using the -e flag, which lets you look for one pattern or another pattern. The following command will extract only the header line with the semantic meaning of the column of data.
grep -e Date Goldie2019.csv
To extract two patterns, use a second -e followed by the pattern. The following will extract both the header line and all of the lines containing 07/01.
grep -e Date -e 07/01 Goldie2019.csv
Double-check that each of the extracted files has the same number of lines. Use the unix command
wcto check. The command wc returns the number of lines, words, and characters in a file.
wc MLRCJuly.csv 2976 7776 240840
If you extract the headers, you should have one more line (2977).
- Extract useful columns
Use the cut command to extract columns 1, 12, 17, 18, and 22 from the GoldieJuly.csv file. These columns correspond to (1) Date/Time, (12) 3m temp, (17) 5m temp, (18) 7m temp, and (22) surface PAR. Direct the output to the file GoldieJulyCut.csv. Then use cut to extract columns 3, 6, and 8 from the MLRCJuly.csv file, directing the output to the file MLRCJulyCut.csv. These columns correspond to (3) air temp, (6) wind speed, and (8) wind direction. You may want to add the headers back into your data files before merging.
+ (more detail)
The cut command takes two options and the name of the file to process. The first option is -d, followed by the character to use to separate fields. The second option is -f, followed by a comma separated list of fields to extract. The following would use a comma to separate the fields and extract columns 2, 4, and 6 from the file mydata.csv.
cut -d "," -f 2,4,6 mydata.csv
To direct the output to a file, use the > symbol followed by the file to receive the data.
cut -d "," -f 2,4,6 mydata.csv > newfile.csv
Double-check that the two output files contain the proper columns.
- Paste the extracted columns together
Use the paste command to combine the two cut files together, using a comma to separate the fields. Direct the output to the file GoldieMLRCJuly.csv.
+ (more detail)
The paste command lets you specify what character to use to keep the fields separated when two files are combined. By defailt it is a tab. To tell paste to use a comma, use the option
-d ","followed by the names of the two files to paste.
The following combines the two files one.csv and two.csv using a comma delimiter and directs the output to the file joined.csv.
paste -d "," one.csv two.csv > joined.csv
Double-check that the merged file contains all of the columns.
Use the GoldieMLRCJuly.csv file for all of the following tasks.
- Calculate some max and min statistics
Create a new file temps.py. Using the file hightemp.py as a template, have your code compute the high and low temperature of both the air and the water at a 3m depth.
Your code should nicely print out the high and low air temperature and the high and low water temperature. Include this information in your report.
- Calculate the number of sunny and cloudy days
Create a new file sunny.py. As before, you can use the hightemp.py as a template. Use the value of the PAR column measured at "12:03:00 PM" to determine if a day is sunny or cloudy. If the PAR value at that time is greater than 800 it is sunny. Otherwise, it is cloudy.
Have your code print out the number of sunny days and the average PAR value of the sunny days. Then it should print out the number of cloudy days and the average PAR value of the cloudy days.
+ (more detail)
- Create variables to hold the number of sunny days, the number of cloudy days, the sum of the sunny day PAR values, and the sum of the cloudy day PAR values. Initialize them all to have the value 0.
- Inside your loop, you can use the expression
A in Bto check if the substring A matches some part or all of B. For example, the following code will print 'Yes'.
smallthing = "hel" bigthing = "hello world" if smallthing in bigthing: print('Yes')
Use this functionality to check if the string "12:03:00 PM" is in the date/time field of the line. The date/time field is the first item in the words list: words.
- If the line is at the proper time, then start a new block of code.
- Assign to a variable the result of casting the PAR field, words, to a float.
- If the PAR value is greater than 800, add one to the number of sunny days and add the PAR value to the sum for sunny days.
- Else add one to the number of cloudy days and add the PAR value to the sum for cloudy days.
- At the end of your function, calculate the average PAR value for sunny and cloudy days by dividing their respective sums by their respective counts. Print the values nicely.
The sum of the number of sunny and cloudy days should be 31. Include the data in your report.
- Compute a statistic of your choice
Pick another statistic you find interesting (e.g. min, max, average, standard deviation, percent change). Then write a program that calculates its value and prints the result. In your report, describe what statistic you calculated, how you did it, and what the values were.
- Write a CSV file with extracted data and plot it
Create a new file extract.py. You can use one of your prior files as a template.
Write a function to extract the air temperature at 3pm (3:03:00 PM) on each day of July. Write to a CSV file the day of the month (1 to 31) and the corresponding 3pm air temperature. Each row of the resulting file should have two values separated by a comma.
When you write the CSV file, make sure to include a header row at the top of the file with appropriate names for each column.
Use your favorite plotting tool to generate a plot of day (x-axis) versus 3pm air temperature (y-axis). Include the plot in your report.
- Extract two other variables to plot
Create a new file extract2.py
Choose two variables and a time schedule for extracting them. For example, you might pick the water temperature at 5m depth at 2am, 8am, 2pm, and 8pm each day. It is up to you how many data points to grab, but you cannot grab all of them. Pick variables that might be interesting to contrast. Use the same time schedule for both variables.
Write the code to extract the values you chose and write them to a CSV file. Each row of the file should have three values separated by commas. Make sure the file has a header row.
Use your favorite plotting tool to make a plot with time (x-axis) and value (y-axis). Use a metric of a day as having value 1. So if you extract data at midnight, 6am, noon, and 6pm, use the time values N, N.25, N.5, and N.75, where N is the day of the month.
Include your plots in your report along with a description of what values you extracted and on what time schedule.
- What does it mean to create and write to a file?
- What is a loop and why are they useful?
- Why should you learn unix command line tools?
- Who is your favorite scientist?
Each assignment will have a set of suggested extensions. The required tasks constitute about 85% of the assignment, and if you do only the required tasks and do them well you will earn a B+. To earn a higher grade, you need to undertake one or more extensions. The difficulty and quality of the extension or extensions will determine your final grade for the assignment. One complex extension, done well, or 2-3 simple extensions are typical.
- Do more exploration of the data. Calculate additional statistics, or create additional plots. See if you can find anything interesting about the data. Hint: there is a correlation between the direction of the wind and the amount of mixing in the lake. Mixing is when cooler water from below mixes with warmer water from above.
- Compare the statistics for July with the statistics for August. Make sure the data you pick is not missing or strange.
- Compute other properties of the data. For example, compare the wind gust values with the wind speed values and see what the variation is. What is the largest wind gust in the data set? What is the largest percentage difference between wind speed and wind gust? What day saw the most wind gust? What day saw the most sun?
- Instead of looking at PAR at just noon, try looking at multiple times of day and using the average. How different are your counts of sunny/cloudy days using different times or combinations of times?
- Compute an average value for a statistic for each day of the month using all of the values taken during that day.
- As part of your report, show that you can get the same results using different types of patterns in grep. Use your writeup to teach yourself or your peers something about basic regular expressions.
Submit your code
Turn in your code (all files ending with .py) by putting it in a directory in the Courses server. On the Courses server, you should have access to a directory called CS152, and within that, a directory with your user name. Within this directory is a directory named private. Files that you put into that private directory you can edit, read, and write, and the professor can edit, read, and write, but no one else. To hand in your code and other materials, create a new directory, such as project1, and then copy your code into the project directory for that week. Please submit only code that you want to be graded.
When submitting your code, double check the following.
- Is your name at the top of each code file?
- Does every function have a comment or docstring specifying what it does?
- Is your handin project directory inside your Private folder on Courses?
Write your project report
For CS 152 please use Google Docs to write your report. Create a new doc for each project. Start the doc with a title and your name. Attach the doc to your project on Google classroom. Make sure you click submit when you are done. The graders cannot provide feedback unless you click submit.
Your intended audience for your report is your peers not in the class. From week to week you can assume your audience has read your prior reports. Your goal should be to be able to use it to explain to friends what you accomplished in this project and to give them a sense of how you did it.
Your project report should contain the following elements.
A brief summary of the project, in your own words. This should be no more than a few sentences. Give the reader context and identify the key purpose of the assignment.
Writing an effective abstract is an important skill. Consider the following questions while writing it.
- Does it describe the CS concepts of the project (e.g. writing well-organized and efficient code)?
- Does it describe the specific project application?
- Does it describe your the solution or how it was developed (e.g. what code did you write)?
- Does it describe the results or outputs (e.g. did your code work as expected)?
- Is it concise?
- Are all of the terms well-defined?
- Does it read logically and in the proper order?
- A description of your solution to the tasks, including any text output or images you created (including the three required images mentioned above). This should be a description of the form and functionality of your final code. Note any unique computational solutions you developed or any insights you gained from your code's output. You may want to incorporate code snippets in your description to point out relevant features. Code snippets should be small segments of code--usually less than a whole function--that demonstrate a particular concept. If you find yourself including more than 5-10 lines of code, it's probably not a snippet.
- A description of any extensions you undertook, including text output or images demonstrating those extensions. If you added any modules, functions, or other design components, note their structure and the algorithms you used.
- The answers to any follow-up questions (there will be 3-4 for each project).
- A brief description (1-3 sentences) of what you learned. Think about the answer to this question in terms of the stated purpose of the project. What are some specific things you had to learn or discover in order to complete the project?
- A list of people you worked with, including TAs and professors. Include in that list anyone whose code you may have seen, such as those of friends who have taken the course in a previous semester.