cut: a brief overview

cut: a brief overview

Main course page

The Unix cut command is an extremely useful tool for filtering information. If we consider an input to be one line of a file or stream, cut can return a piece of that input. It could, for example, return third character from each line. If the input is many lines, cut will return a piece of each line. The power of cut comes from our ability to specify the rule for extracting the piece of each line that cut returns.

Consider the following two lines of input.

05/03/2015 12:00:00 PM,12.4,9.33,1.7,7.33,101.7,12.34,4.79,99.7,12.90,8.01
05/03/2015 1:00:00 PM,12.3,9.50,1.5,7.31,103.3,12.53,4.94,99.8,12.86,7.97

If we wanted to extract the four characters giving the year, these will be the characters at positions 7, 8, 9, and 10. Column numbering starts at 1 when using the cut tool. So we could give the following cut command to extract those four characters from each line.

cut -c 7-10

If you want to try this out, copy the two lines on input above into a file and call it test.txt, and then tell cut to use that file as input. You can also download this file.

cut -c 7-10 test.txt

The output should be two lines of 2015.

2015
2015

The problem with using specific positions in a string is that in something like a data file, the same information may not always be at the same position. For example, in the case above, anything after the time field will have a different position depending on whether the hour is one digit or two. The AM/PM, information, for example, will be in either 21 and 22, or 22 and 23.

Fortunately, cut allows us to specify fields instead of character positions. A field is a group of characters separated from other characters by a specific delimiter, such as a space or a comma. For example, if we wanted to specify the data and time field, we could say it was the first field as delimited by a comma.

cut -f 1 -d ',' test.txt

The output should be two lines with the date and time field.

05/03/2015 12:00:00 PM
05/03/2015 1:00:00 PM

As another example, we can get the 1m temperature field, field 11, by just changing the argument to cut.

cut -f 11 -d ',' test.txt

The output should be two numbers, the last number in each row.

8.01
7.97

It's also possible to use cut again on its own output. Consider, for example, the task of extracting the AM/PM field from the date and time. Those characters are not always in the same place, so we can't use specific character locations. They are separated from the rest of the date and time field by spaces, but they are separated from the other information by commas.

To solve this problem, we can break it down into two steps. First, isolate the first field from the rest of the line using commas as a delimiter. Second, isolate the AM/PM field from the rest of the date/time field using spaces as the delimiter. We can use the Unix pipe command to send the output of one call to cut to a second call to cut.

cut -f 1 -d ',' test.txt | cut -f 3 -d ' '

The output should be two cases of PM.

PM
PM

It's another example of how we can take a more complex problem and break it down into steps that can be solved by tools we already have.