CS 152: Lab 2

Title image Project 2
Fall 2019

Lab Exercise 2: Searching and Splitting

The goal of this week's project is to start converting data into knowledge. We'll answer questions like, how many sunny days were there in this past July at Great Pond. To do that, we have to start looking at the values of the data on each day and have the computer make decisions.

The purpose of this lab time is to give you some practice with the Unix tools grep, cut, and paste, which are handy tools for manipulating data files. In addition, we will examine how to split a string into pieces using the Python string function split. These two capabilities will be critical for the project.


Tasks

  1. Setup your workspace

    Mount the Personal fileserver, and make a folder called project2 in your directory. Then use the cd command in a Terminal to go your new project2 folder.

    + (more detail)

    Command-K in the Finder is the shortcut to the fileserver connection window. Use the following fileserver address. It should be automatic on the lab computers.

    smb://filer.colby.edu

    Once mounted, create a project2 folder. You can do this in the finder by using cmd-shift-N or in the terminal by using the cd command to move to your cs152 directory and then using the command

    mkdir project2

    to create a project2 folder.

    Open Terminal and change your working directory to the project2 folder. Change directories by typing cd and then the path to the directory. You can either type the path to your project2 directory in Terminal or you can use the Finder to drag and drop the path to your project2 folder into Terminal after you type cd and a space on the terminal command line.


  2. grep

    Grep is a very useful tool for searching for patterns in data. So far, we have used it to search for a specific string. However, what if we want to search for more than one string, or strings that follow particular rules? In general, grep lets us search for basic regular expressions or extended regular expressions using the -E flag. To learn more about the format of basic regular expressions, type man re_format in the Terminal. To learn more about grep, type man grep in the Terminal.

    Regular expressions are a powerful method of describing patterns. We're going to look at two specific capabilites. Consider the following file.

    date,month,days
    1/13/2014,cold,31
    2/10/2014,colder,28
    3/10/2014,cold,31
    4/14/2014,wet,30
    5/12/2014,muddy,31
    6/9/2014,wet,30
    7/14/2014,hot,31
    8/11/2014,hotter,31
    9/8/2014,cool,30
    10/13/2014,cooler,31
    11/10/2014,chilly,30
    12/8/2014,cold,31
    1/13/2015,colder,31
    2/10/2015,colder,28
    3/10/2105,cold,31
    4/14/2015,chilly,30
    5/12/2015,wet,31
    6/9/2015,warm,30
    7/14/2015,hot,31
    8/11/2015,warm,31
    9/8/2015,warm,30
    10/13/2015,cool,31
    11/10/2015,cool,30
    12/8/2015,cold,31
    1. Download the file dates.csv

      Right click on the link and use Save Link As... or Download Linked Filed As...

    2. A simple example

      Try the following use of grep that searches for any line containing the string "28".

      grep 28 dates.csv

      The result should be two lines of the file, both ending in 28.

      Try a second example and look for the pattern "11". There should be five lines that contain an 11.

    3. More complex patterns

      What if we want to find all the lines that start with a 3 in the date? If we search for the pattern "3/" it turns out we also get the cases with a 13 in the second field as well.

      grep 3/ dates.csv

      You can specify that you want the pattern to be at the start the line by using the ^ character. For the standard grep usage, put a \ in front of the ^.

      grep \^3/ dates.csv

      The above line returns only lines that start with the pattern 3/

    4. Building a sophisticated pattern (optional)

      You can define patterns that allow multiple values in a pattern.

      + (more detail)

      What if we wanted to find the lines corresponding to August and September of 2014? The pattern we want is something like '#/#/2014', where the first number is either an 8 or a 9, and the second number is any one or two digit value. A regular expression allows us to specify single characters from a set of choices by using brackets. The expression [89] means that grep can match an 8 or a 9. Try the following pattern.

      grep '[89]' dates.csv

      You should get every line in the file that contains an 8 or 9. If you add a slash after the [89], then it will find lines that have an 8 or a 9 followed by a slash.

      grep '[89]/' dates.csv

      The second number consists of one or two digits, and they could be any digit. Rather than having to enumerate all of the digits, we can use [:digit:] to represent the set of all digits. Try the following pattern.

      grep '[89]/[:digit:]' dates.csv

      This pattern fails to find anything, because [:digit:] (which has to be interpreted as a single special character) works only if it is inside another pair of brackets, just like we put the 89 in brackets. Try the following.

      grep '[89]/[[:digit:]]' dates.csv

      This pattern still doesn't do what we want, as it will get dates where there is an 8 or 9 in the second field. Adding a second slash to the pattern should eliminate some of the lines we do not want. Try the following.

      grep '[89]/[[:digit:]]/' dates.csv

      This time, the problem is that the pattern is too strict. It grabs only the line where there is a single digit between the slashes. However, there can be either one or two of digits in the middle field. We can specify that there are one or more digits by using the special combination \+ after a symbol or bracket expression. So the expression '[[:digit:]]\+' specifies one or more digits. Therefore, we can extend our overall pattern to the following. Try it out.

      grep '[89]/[[:digit:]]\+/' dates.csv

      Note: this is the first attempt where you actually have to put quote marks around the expression. The reason is that this expression contains the backslash character \. The backslash character has a special meaning to the Terminal, so it does not pass the expression to grep unchanged. By putting quotes around the expression, we tell the Terminal to pass the expression to grep unchanged.

      The final touch is to stick 2014 on the end. This expression should give us August and September of 2014.

      grep '[89]/[[:digit:]]\+/2014' dates.csv

      Verify that your output is

      8/12/2014,hotter,31
      9/9/2014,cool,30

  3. cut

    The cut command is a useful tool for extracting specific fields from a data file, such as a CSV file. To use cut, you specify the delimiter character separating the fields, then provide a list of the fields to extract from the file.

    The dates.csv file has three fields: a date, a string, and a number. To extract just the first column of data, use the following command.

    cut -d "," -f 1 dates.csv

    Try extracting each of the other two fields by changing the number after the -f.

    To extract more than one field from the file, use the field indexes you want separated by commas. The following command should select just the date and string from the dates.csv file.

    cut -d "," -f 1,2 dates.csv

    You can string together multiple unix commands by using the pipe symbol |. For example, you can cut out the 2nd and 3rd fields and then grep for "warm".

    cut -d "," -f 2,3 dates.csv | grep warm

    Note that grep is processing the output of the cut command, so you don't have to provide a file to it. Try some other combinations. Note that you can put grep before cut, if you wish.


  4. paste

    Imagine you have two data files that each have the same number of rows, but different data. You want to merge these two data files together, combining the first row from file 1 with the first row from file 2, and so on. The Unix command paste is the tool you want.

    Download the file temps.csv. It contains the same number of lines as the dates.csv file, but shows the high and low temperature (F) for the contiguous United States on the corresponding dates.

    Use paste on the two files with no argument and see what it does.

    paste dates.csv temps.csv
    Note that, by default, it puts a tab in between the lines from each file. If we want it to insert a comma instead, we need to tell it to do so using the -d flag (see man paste for details).

    paste -d ',' dates.csv temps.csv

    If you recall from the last project, you can redirect output from the terminal to a file. The > symbol tells the terminal to send anything going to stdout to the specific file. For example, the following command sends the output of the paste command to the file blend.csv.

    paste -d ',' dates.csv temps.csv > blend.csv

    Verify that the contents of blend.csv make sense. You can open the file, or you can use the command cat. The cat command dumps the contents of a file to the terminal window.

    cat blend.csv

    Remember, for any unix command, you can find out more by typing man followed by the command.


  5. Reading a file

    The first task in any data analysis is reading the data from the data file. For this task write a program that prints the contents of the blend.csv file to the terminal.

    1. Create a new python file

      Create a new file in TextWrangler. Save it as hightemp.py. Copy the following template into your file. The template puts the body of your code inside a function main, and then calls the main function at the end after checking to see if the file was executed and not imported.

      Whenever you copy and paste code into a file, select all of the text (cmd-A) and then select Text::Entab to make the white space consistent.

      # Your Name
      # Fall 2019
      # CS 152 Project 2
      #
      
      # any required import statements here
      
      # main function here
      def main():
      
          # main code here
      
      
      # only execute main if this file was executed
      if __name__ == "__main__":
          main()

      The last two lines will be new this week. In all of our future coding, we will encapsulate all of our code in functions. The top-level or master function is often called main (but it does not have to be). By encapsulating all code in functions, it makes it easier to import existing code files into other files to re-use the functionality. However, if we want to run a file, we want the main function in that file to execute. The if-statement in the above code differentiates between whether a file was executed on the Terminal (command-line) or imported into another Python file. If it was imported, then we do not want the main function to automatically execute. The if-statement is true only when the file is executed directly, so if it is imported, the main function does not execute.

    2. Open a file for reading

      Use the open function to open a file for reading. The first argument should be the name of the file, and the second argument should be the string "r", which tells Python to open the file for reading. Start the main function with the following line of code.

      fp = open( "blend.csv", "r" )
    3. Read one line of a file

      Use the readline method of the file object fp to read one line of the file. The readline functions returns a string that is the whole line of the file. Add the following line of code.

      line = fp.readline()

      If your data file has a header, then you need read another line to get to the first line of actual data. Duplicate the above line if your data file has a header row.

    4. Read the whole file using a while loop

      To read the whole file, we need to keep reading until the readline function returns an empty string. A simple way to repeat a set of commands is to use the Python keyword while followed by the condition under which the loop should keep executing. In this case, the condition should be that the loop should run while the line returned by fp.readline() has a length greater than zero.

      The python function that returns the length of a sequential object is len() The expression len(line) returns the number of characters in the line variable.

      Combining the while keyword with the condition len(line) > 0, the following defines a while loop and begins the block of code to be repeated.

      while len(line) > 0:
    5. Print the lines of the file to the terminal

      In the body of the loop, print the contents of the variable line. Remember to tab in the code that forms the body of the loop.

      Get the next line of the file by assigning to the line variable the result of calling fp.readline(). Now the loop should continue until python reads an empty line.

    6. Close the file

      After the while loop and outside of it, use the close method of the file to clean up. Make sure it is tabbed to the same place as the while statement.

      fp.close()
    7. Test your code

      Run your code and see if it prints the contents of the blend.csv file to the Terminal.


  6. split

    The prior task shows that we can access each line of the file blend.csv. What if we want to find the date and value of the highest temperature? To find the highest temperature we have to extract the fourth field from each line, convert it to a floating point value, and then keep track of the highest value. If we want the corresponding date, we have to keep track of it as well.

    Python strings have a useful method called split that divides a string based on a character or pattern. In this case, the fields of the line are separated by commas. Modify the while loop in the main function from the prior task. After the print statement, but before the call to fp.readline(), assign to words the result of calling line.split(","). The argument to split tells it what character to use to divide the string.

    After the split statement, print the value of words. Then run your program and see the result. The individual fields of each line should be a sequence of strings separated by commas, with the whole sequence surrounded by squared brackets.

    The variable words is what we call a list. Visually and syntactically, Python represents a list as square brackets with comma-separated elements. To access the elements of a list, use what is called bracket notation. The first element of the list contained in words is words[0]. The second element of the list is words[1], and so on. Note that Python uses what is called zero-indexing, which means that the first element of a list has the index 0.

    In your loop, after the assignment to words, assign to temp the result of casting words[3] to a float. Then assign to date the first entry in the list: words[0]. Now your code has access to each high temperature value in the blend.csv file and its corresponding date.

    If you run into an error where Python says it can't convert hitemp to a float, then you need to add a duplicate of the code

    line = fp.readline()

    before the while loop in order to skip the header line of your data file.


  7. Keeping track of the high temperature

    How do you keep track of the highest value in a list of numbers being read to you? The first number will be the highest you have seen so far. After than, you have to compare each new number to the highest value you have heard so far, updating your highest value if a higher one comes along. We want to have our code do the same thing.

    1. Create a variable to keep track of the highest temperature seen so far. At the start of your main function, before the while loop, assign to hitemp a large negative value.
    2. Create a variable to keep track of the date of the highest temperature. After assigning hitemp, assign to hidate the empty string.
    3. Inside the loop, after grabbing the current temperature and date, check if the value of temp is greater than the value of hitemp. Use an if statement. If temp is greater than hitemp, then assign to hitemp the value of temp and assign to hidate the value of date.
    4. After the while loop, print out the values of hitemp and hidate. Look at the blend.csv file. Is your answer correct?

    (You may want to remove some of the print statements inside the loop to make the output cleaner.)


  8. Formatted printing in Python (optional)

    Note that when you print out floating point numbers, the number of decimal places Python uses varies. Sometimes it prints out a lot, sometimes just a few. Python doesn't care about significant figures and doesn't worry about making things look nice. That's your job.

    + (more detail)

    Fortunately, Python gives us an easy way to control how numbers are formatted when you print them to the Terminal or to a file. This is called formatted printing. The concept is to write out the string you want to print with placeholders for variables. The placeholders specify how the value is to be formatted.

    Try the following example in your code when you print out the average high temperature.

    print("Highest Temp: %f" % (hitemp))

    The % sign indicates that this is a placeholder for a variable. The f character indicates that the value to be printed is a floating point value. Test out your code.

    Note that there are still lots of decimal places, perhaps more than are useful, when Python prints the floating point number. Fortunately, we can specify how many decimal places to use in our format string. The following tells Python to use three decimal places.

    print("Highest Temp: %.3f" % (hitemp))

    We can also tell Python to use a certain number of characters for the whole field by putting a number in front of the decimal in our format string. This allows us to line up the decimal points on a column of numbers.

    print("Highest Temp: %7.3f" % (hitemp))

    It is possible to have multiple format expressions in a single string. The format code %s is a placeholder for a string. The following prints both the date and temperature.

    print("The highest temperature of %.3f occurred on %s" % (hitemp, hidate))

    Try out the above statements and test out what varying the two numbers does to the format of the output.


When you are done with the lab exercises, you may begin the project.