Objectives

The goal of this week's project is to start converting data into knowledge. We'll answer questions like, how many sunny days were there in June and July at Great Pond? To do that, we have to start looking at the values of the data on each day and have the computer make decisions.

The purpose of this lab time is to give you some practice with the Unix tool grep as well as to examine how we can split a string into pieces using the Python string function split(). These two capabilities will be necssary for the project.


Tasks

Mounting the File Server

Mount the Personal fileserver, as in Lab 1. Pressing Cmd-k while in the Finder is a shortcut to the fileserver connection window.

smb://filer.colby.edu

Once mounted, create a project2 folder. Then open the Terminal and change your working directory to the project2 folder. You change directories by typing cd and then the path to the directory. You can either type the path to your project2 directory out by hand or you can use the Finder to drag and drop the path to your project2 folder into the Terminal. (You will still have to type cd and a space, first.)

grep

grep is a very useful tool for searching for patterns in data. So far, we have used it to search for a specific string. However, what if we want to search for more than one string, or strings that follow particular rules? In general, grep lets us search for basic regular expressions or extended regular expressions using the -E flag. To learn more about the format of basic regular expressions, type man re_format in the Terminal. To learn more about grep, type man grep in the Terminal.

Regular expressions are a powerful method of describing patterns. We're going to look at two specific capabilites. Consider the following file.

date,month,days
1/13/2014,cold,31
2/10/2014,colder,28
3/10/2014,cold,31
4/14/2014,wet,30
5/12/2014,muddy,31
6/9/2014,wet,30
7/14/2014,hot,31
8/11/2014,hotter,31
9/8/2014,cool,30
10/13/2014,cooler,31
11/10/2014,chilly,30
12/8/2014,cold,31
1/13/2015,colder,31
2/10/2015,colder,28
3/10/2105,cold,31
4/14/2015,chilly,30
5/12/2015,wet,31
6/9/2015,warm,30
7/14/2015,hot,31
8/11/2015,warm,31
9/8/2015,warm,30
10/13/2015,cool,31
11/10/2015,cool,30
12/8/2015,cold,31

Download the dates.csv file so you can test out the grep command. (Right click on the link and use Save Link As... or Download Linked Filed As...)

What if we wanted to find the lines corresponding to August and September of 2014? The pattern we want is something like '#/#/2014', where the first number is either an 8 or a 9, and the second number is any one- or two-digit value. A regular expression allows us to specify single characters from a set of choices by using brackets. The expression [89] means that grep can match an 8 or a 9. Try the following pattern.

grep '[89]' dates.csv

You should get every line in the file that contains an 8 or 9. If you add a slash after the [89], then it will find lines that have an 8 or a 9 followed by a slash.

grep '[89]/' dates.csv

The second number could be any one- or two-digit value. Rather than having to enumerate all of the digits, we can use the regular expression [:digit:] to represent the set of all digits. Try the following pattern:

grep '[89]/[:digit:]' dates.csv

This pattern fails to find anything, because [:digit:] (which has to be interpreted as a single special character) works only if it is inside another pair of brackets, just like we put the 89 in brackets. Try the following:

grep '[89]/[[:digit:]]' dates.csv

This pattern still doesn't do what we want because it is not constrained to matching an 8 or 9 in the first (month) field. It also matches dates in which there is an 8 or 9 in the second (day) field. Adding a second slash to the pattern should eliminate some of the lines we do not want (with an 8 or 9 in the day, but not the month). Try the following:

grep '[89]/[[:digit:]]/' dates.csv

This time, the problem is that the pattern is too strict. It grabs only the line where there is a single digit between the slashes (i.e. only dates with days 1-9). However, there can be either one or two of digits in the middle field. We can specify that there are one or more digits by using the special combination \+ after a symbol or bracket expression. So the expression '[[:digit:]]\+' specifies one or more digits. Therefore, we can extend our overall pattern to the following. Try it out.

grep '[89]/[[:digit:]]\+/' dates.csv

Note: this is the first attempt where you actually have to put quote marks around the expression. That is because this expression contains the backslash character \. The backslash character has a special meaning to the Terminal, so it does not pass the expression to grep unchanged. By putting quotes around the expression, we tell the Terminal to pass the expression to grep unchanged.

The final touch is to stick 2014 on the end. This expression should return all lines in dates.csv with dates that fall in August and September of 2014 (and only those lines):

grep '[89]/[[:digit:]]\+/2014' dates.csv

Verify that your output is

8/12/2014,hotter,31
9/9/2014,cool,30

paste

Imagine you have two data files that each have the same number of rows, but different data. You want to merge these two data files together, combining the first row from file 1 with the first row from file 2, and so on. The Unix command paste is the tool you want.

Download the file temps.csv. It contains the same number of lines as the dates.csv file, but shows the high and low temperature (F) for the contiguous United States on the corresponding dates.

Try using paste on the two files with no argument and see what it does. Note that, by default, it puts a tab in between the lines from each file. If we want it to insert a comma instead, we need to tell it to do so using the -d flag (see man paste for details).

paste -d ',' dates.csv temps.csv

One other useful thing we can do with the Unix shell is redirect output from the Terminal to a file. The > symbol tells the Terminal to redirect anything heading towards stdout to the specified file, instead. For example, the following command sends the output of paste to the file blend.csv.

paste -d ',' dates.csv temps.csv > blend.csv

Verify that the contents of blend.csv make sense by opening the file in TextWrangler.

split( )

Create a new file in TextWrangler. Save it as temps.py. Remember to start every new program with three comments at the top: your name, the date, and the course and project numbers. Copy the following template into your file. This template puts the body of your code inside a function named main(), and then calls the main() function only if the file was executed and not imported (more on this below).

# Your Name
# Today's date
# CS 152 Project 2
#
# Command to run the program:
# grep /2014 blend.csv | cut -f 4,5 -d ',' | python3 temps.py
#

import sys

def main(stdin):

    # main code here


if __name__ == "__main__":
    main(sys.stdin)

The last two lines will be new this week. In all of our future coding, we will encapsulate all of our code in functions. The "top-level" or "master" function is often called main(), but it does not have to be. Encapsulating all code in functions makes it easier to import existing code files into other files. This makes it possible to use your code in another program without re-typing it (or copying and pasting). However, if we want to run a file, we want the main() function in that file to execute. The if statement in the template above differentiates between whether a file was executed from the Terminal (command-line) or imported into another Python file. If it was imported, then we do not want the main() function in temp.py to automatically execute; it should wait to be called by the program that imported it. The if statement's condition (__name__ == "__main__") evaluates as true only when the file is executed directly. So, if temp.py is imported, its main() function will not execute automatically. It will not execute unless called by the program that imported it.

The goal of this task is to find the average high and low temperatures during 2014. We can simplify our task by first using grep to find all lines with the string /2014, then using cut to extract fields 4 and 5 (the high and low temperatures). However, that means that the high and low temperatures will be lumped together in each line (separated by a comma). Test the first two components of the command above on the blend.csv file and see if it gives you a stream of numbers in two columns.

In order to separate the two numbers in the stream, we need a way to split a string into pieces inside Python.

Start by creating the overall loop that reads a line from the stream until it receives an empty line. The following code should go inside the main() function:

# assign to buf the result of calling stdin.readline()
# while buf.strip() != '':
    # 
    # Your other code will go here
    #
    # assign to buf the result of calling stdin.readline()

We do not have to use sys.stdin.readline() because sys.stdin is passed in as the argument to main(). We do this so that another Python function could also call main() with its own data to process.

Put a statement in your while loop that prints the value of buf to the Terminal. Then run your program to make sure that it prints out correctly.

Next, your while loop should store the result of the function call buf.split(',') in the variable words. Calling the split() function of the string buf with ',' as an argument divides the string into pieces, splitting it on the commas. After assigning the split() result to words, have your loop print words to the Terminal. Test it and see what it prints out.

The variable words is an example of Python's list data type. Visually and syntactically, Python represents a list as square brackets containing comma-separated elements. Elements of a list are accessed using bracket notation: the first element of the list contained in words is words[0], the second element of the list is words[1], and so on. Note that Python uses what is called zero-indexing, which means that the first element of a list has the index 0.

In your loop, after the assignment to words, assign to hitemp the result of casting words[0] to a float. Then assign to lotemp the result of casting words[1] to a float. Comment out the other print statements in the loop and add a statement that prints hitemp and lotemp. Test your code and make sure it prints out two columns of floating point numbers.

Now we're going to calculate the average high temperature and average low temperature. Prior to the start of your loop, initialize three variables -- count, hisum, and losum -- to zero. Inside the loop, increment count by 1, hisum by hitemp, and losum by lotemp. Remember, you can increment a variable by using the += notation. The two following expressions are equivalent: a = a + b and a += b.

After the loop, but still inside the main function, print out the average high temperature value (hisum/count) and the average low temperature value (losum/count).

Formatted Printing in Python

Note that when you print out floating point numbers, the number of decimal places Python uses varies. Sometimes it prints out a lot, sometimes just a few. Python doesn't care about significant figures and doesn't worry about making things look nice. That's your job.

Fortunately, Python gives us an easy way to control how numbers are formatted when you print them to the Terminal or to a file. This is called formatted printing. The concept is to write out the string you want to print with placeholders for variables. The placeholders specify how the value is to be formatted.

Try the following example in your code when you print out the average high temperature.

print('Average Hi Temp: {0:f}'.format(hisum/count))

The curly brackets in {0:f} indicate that this is a placeholder for a variable. The f indicates that the value to be printed is a floating point value. The zero indicates that the variable to be printed is the first parameter passed into the str.format() method. (Remember: Python is zero-indexed, so index 0 refers to the first element.)

There are actually two function calls in this one line of code. First, we invoke the str.format() method belonging to the string 'Average Hi Temp: {0:f}', passing in the parameter hisum/count for formatting. Then, print() is used to output the resulting formatted string to the Terminal. Whenever functions are nested like this, the function calls are always performed from innermost (first) to outermost (last).

Here is another, equivalent way to format a print statement in Python:

print('Average Hi Temp: %(avgHi)f' % {'avgHi': hisum/count})

In this case, the %(avgHi) indicates that the string contains a value that should be printed, and that value is identified -- outside the string, after the % symbol, in curly braces -- as avgHi. The value is formatted as a float by including f immediately after the parentheses. Note that avgHi is not the name of a variable in your program. This identifier could be anything, as long as it appears both in the %() marker within the string and in curly braces following the string. For example, the following statements will also work:

print("Average Hi Temp: %(0)f" % {'0': hisum/count})

print("Average Hi Temp: %(#)f" % {'#': hisum/count})

As a rule of thumb, it's generally a good idea to be more descriptive than "#."

I tend to use the first method discussed in this section (curly braces within the string) to format printed values, so you will see this most often in future lab assignments. I find that it method helps me stay organized and avoid bugs. You are free to choose whichever method you prefer.

Add formatting to your average high and average low print statements, and test your code.

Note that regardless of which method you used to format the values as floating point numbers in your print statements, Python still displays them with lots of decimal places -- perhaps more than are useful. Fortunately, we can specify how many decimal places to use in our formatting expression. The following tell Python to use three decimal places.

print('Average Hi Temp: {0:.3f}'.format(hisum/count))

print('Average Hi Temp: %(avgHi).3f' % {'avgHi': hisum/count})

We can also tell Python to use a certain number of characters for the whole field by putting a number in front of the decimal in our format string. This allows us to line up the decimal points on a column of numbers.

print('Average Hi Temp: {0:7.3f}'.format(hisum/count))

print('Average Hi Temp: %(avgHi)7.3f' % {'avgHi': hisum/count})

Try out the above statements, and test to see what happens to the output when you vary the two numbers in the formatting expression.


When you are done with the lab exercises, you may start on the rest of the project.


© 2017 Caitrin Eaton.