Lab Exercise 2: Delving into grep and split
The goal of this week's project is to start converting data into knowledge. We'll answer questions like, how many sunny days were there in June and July at Great Pond? To do that, we have to start looking at the values of the data on each day and have the computer make decisions.
The purpose of this lab time is to give you some practice with the Unix tool grep as well as to examine how we can split a string into pieces using the Python string function split. These two capabilities will be necssary for the project.
Grep is a very useful tool for searching for patterns in data. So far, we have used it to search for a specific string. However, what if we want to search for more than one string, or strings that follow particular rules? In general, grep lets us search for basic regular expressions or extended regular expressions using the -E flag. To learn more about the format of basic regular expressions, type man re_format in the Terminal. To learn more about grep, type man grep in the Terminal.
Regular expressions are a powerful method of describing patterns. We're going to look at two specific capabilites. Consider the following file.
date,month,days 1/13/2014,cold,31 2/10/2014,colder,28 3/10/2014,cold,31 4/14/2014,wet,30 5/12/2014,muddy,31 6/9/2014,wet,30 7/14/2014,hot,31 8/11/2014,hotter,31 9/8/2014,cool,30 10/13/2014,cooler,31 11/10/2014,chilly,30 12/8/2014,cold,31 1/13/2015,colder,31 2/10/2015,colder,28 3/10/2105,cold,31 4/14/2015,chilly,30 5/12/2015,wet,31 6/9/2015,warm,30 7/14/2015,hot,31 8/11/2015,warm,31
What if we wanted to find the lines corresponding to August and September of 2014? The pattern we want is something like '#/#/2014', where the first number is either an 8 or a 9, and the second number is any one or two digit value. A regular expression allows us to specify strings from a set of choices by using brackets. The expression [8,9] means that grep can match an 8 or a 9. So we can start our pattern with '[8,9]/', which means an 8 or a 9 followed by a forward slash.
The second number consists of one or two digits, and they could be any digit. Rather than having to enumerate all of the digits, we can use [:digit:] to represent the set of all digits. However, there can be either one or two of them. We can specify that there are one or more digits by using the special combination \+ after a symbol or bracket expression. So the expression '[[:digit:]]\+' you should read as one or more digits. Therefore, we can extend our overall pattern to the following.
Note: You might be asking yourself ":Why [[:digit:]] instead of [:digit:]? The short answer is "because grep requires it". The long answer is that [:digit:] is the "name" of the class of characters that are digits. The outer pair of brackets indicate that you need at least one character in that class. So if you want at least one [:digit:], then you need to say [[:digit:]].
The final touch is to stick 2014 on the end. Download the dates.csv file and try out the following command. Note that you have to use the single quotes around the pattern so the bash shell doesn't try to interpret the string before passing it to grep.
grep '[8,9]/[[:digit:]]\+/2014' dates.csv
Verify that your output is
- paste Imagine you have two data files that
each have the same number of rows, but different data. You want to
merge these two data files together, combining the first row from file
1 with the first row from file 2, and so on. The Unix
command paste is the tool you want.
Download the file temps.csv. It contains the same number of lines as the dates.csv file, but shows the high and low temperature (F) for the contiguous United States on the corresponding dates.
Try using paste on the two files with no argument and see what it does. Note that, by default, it puts a tab in between the lines from each file. If we want it to insert a comma instead, we need to tell it to do so using the -d flag (see man paste for details).
paste -d ',' dates.csv temps.csv
One other useful thing we can do with the Unix shell is redirect output from the terminal to a file. The > symbol tells the terminal to send anything going to stdout to the specific file. For example, the following command sends the output of paste to the file blend.csv.
paste -d ',' dates.csv temps.csv > blend.csv
Verify that the contents of blend.csv make sense by opening the file in TextWrangler.
Create a new file in TestWrangler. Save it as temps.py. Copy the following template into your file. The template puts the body of your code inside a function main, and then calls the main function at the end after checking to see if the file was executed and not imported.
# Your Name # Fall 2015 # CS 151S Project 2 # # Command to run the program # # grep /2014 blend.csv | cut -f 4,5 -d ',' | python temps.py # # import sys def main(stdin): # main code here if __name__ == "__main__": main(sys.stdin)
The goal of this task is to find the average high temp and average low temp for 2014. We can simplify our task by first using grep to find all lines with the string /2014, then use cut to extract fields 4 and 5. However, that means each line still contains two numbers. Test the first two components of the command above on the blend.csv file and see if it gives you a stream of numbers in two columns.
In order to separate the two numbers in the stream, we need a way to split a string into pieces inside Python.
Start by creating the overall loop that reads a line from the stream until it receives an empty line. The following code should go inside the main function.
# assign to buf the result of calling stdin.readline() # while buf.strip() != '': # # Your other code will go here # # assign to buf the result of calling stdin.readline()
Put a print statement as the first thing in the while loop and print out buf. Then run your program. This shows you what is in the variable buf.
As the second thing in the while loop, assign to words the result of calling buf.split(','). Calling the split function of buf with a comma as an argument divides the string into pieces, splitting it on the commas. After assigning the split result to words, have your program print words on the next line. Test it and see what it prints out.
From the prior step, the variable words is what we call a list. Visually and syntactically, Python represents a list as square brackets with comma-separated elements. To access the elements of a list, we use what is called bracket notation. The first element of the list contained in words is words. The second element of the list is words, and so on. Note that Python uses what is called zero-indexing, which means that the first element of a list has the index 0.
In your loop, after the assignment to words, assign to hitemp the result of casting words to a float. Then assign to lotemp the result of casting words to a float. Remove the other print statements in the loop and add a print statement that shows hitemp and lotemp. Test your code and make sure it prints out two columns of floating point numbers.
Now we're going to calculate the average high temperature and average low temperature. Prior to the start of your loop, initialize three variables, count, hisum, and losum, to zero. Inside the loop, increment count by 1, hisum by hitemp, and losum by lotemp. Remember, you can increment a variable by using the += notation. The following expression is the same as a = a + b
a += b
After the loop, but still inside the main function, print out the average high temperature value (hitemp/count) and the average low temperature value (lotemp/count).
- Formatted printing in Python
Note that when you print out floating point numbers, the number of decimal places Python uses varies. Sometimes it prints out a lot, sometimes just a few. Python doesn't care about significant figures and doesn't worry about making things look nice. That's your job.
Fortunately, Python gives us an easy way to control how numbers are formatted when you print them to the Terminal or to a file. This is called formatted printing. The concept is to write out the string you want to print with placeholders for variables. The placeholders specify how the value is to be formatted.
Try the following example in your code when you print out the average high temperature.
print "Average Hi Temp: %f" % (hisum/count)
The % sign indicates that this is a placeholder for a variable. The f character indicates that the value to be printed is a floating point value. Test out your code.
Note that we still get lots of decimal places, perhaps more than are useful, when Python prints the floating point number. Fortunately, we can specify how many decimal places to use in our format string. The following tells Python to use three decimal places.
print "Average Hi Temp: %.3f" % (hisum/count)
We can also tell Python to use a certain number of characters for the whole field by putting a number in front of the decimal in our format string. This allows us to line up the decimal points on a column of numbers.
print "Average Hi Temp: %7.3f" % (hisum/count)
print "Average Lo Temp: %7.3f" % (losum/count)
Try out the above statements and test out what varying the two numbers does to the format of the output.
When you are done with the lab exercises, you may begin the project.