LUX 211 - Shell Programming

Lesson 8: Chapter 14, AWK (a little more)

Objectives:

This lesson continues the discussion of the use of AWK. Objectives important to this lesson:

  1. Math operators
  2. Control structures
  3. Builtin variables
  4. Using BEGIN and END
  5. Running an awk program from a file
Chapter 14, continued

What else can we look into about awk?

We should revisit the structure of an awk program briefly. You should know that the command to run an awk program begins with awk (or gawk...) to invoke the awk interpreter. Without this command, the shell would not understand the program. The typical command continues with a pattern to match, a program to run with matches, and a file (or files) from which to get the data.

Not all programs are short. Long ones are typically saved in their own files. Most of a program is run against every match, but our text explains that programs can include two special sections that are run separately. The BEGIN section is run before any data lines are processed, and the END section is run after all of the data lines have been processed. The section between BEGIN and END does not have a name, but if it had one, that name would be MAIN. This is the section of code that is run for every data line. This gives you the opportunity to write three programs that can run consecutively.

Let's take a look at some of the math operators on page 625. Some will look unfamiliar:

  • **
    The double star operator is used with exponents in gawk. Whatever is on its left is raised to the power of whatever is on its right. This operator does not work in some other awk versions. Be aware that you can use the caret (^) operator instead. On page 628, the author gives us two examples of running some code in the BEGIN section of his program. It is too bad that the code he wrote could never produce the output printed in the text.
  • %
    The percent sign is the modulus operator. An illustration will help you understand it:
    27 % 5
    means put 5 into 27 as many times as it will go, then report the remainder, the part that is left over, which is 2 in this case. This is good for converting measurements and money. For instance, you can take a number then do a modulus operation on it with various values of currency to figure out how to make change.
  • ++ and --
    We have already seen the incrementation and decrementation operators. The definitions on that page are a little off. Remember that they can be used as prefix or postfix operators.
  • += -= *= /= %=
    Consider the first of these operators in an example: total+=25 means to add 25 to the current value of the variable called total, then assign the answer to that variable. Each of the operators of this type means to perform a similar process, just using the particular math process defined by its first character. You can have a variable or an expression on both sides of these operators, but you must have a variable on the left, since that is the operand that will be redefined by the operation.

There is a discussion of loops and other control structures on pages 627 through 628. Some of the characteristics of control structures in awk are the same as the ones you already know in the shell, but there are some awk differences:

  • In an if...then structure (or a loop), the conditional test goes inside parentheses, not square brackets.
  • Simple, one-line instructions in such structures do not need to be enclosed by curly braces. Curly braces are still needed for loops and instructions that have more than one command in their body.
  • The for loop control structure we discussed in class last time is the best way to use a for loop. Here is an example: for (x=1; x -le 10; x++). It places initialization, conditional testing, and incrementation in one location, making maintenance easier and more elegant.
  • When printing in awk, remember that $n arguments are positional variables that stand for fields in records, and that awk thinks every line of text is a record.

Last time I mentioned that awk has several builtin variables, which can be redefined, but they are more useful if you don't redefine them. Take a look at the builtin variable article at thegeekstuff.com. I was thinking about an exercise in which you would process each field separately, but do it in a loop. That should be no trouble at all if you use the NF variable, which knows the number of fields in the current record. This would allow the loop to customize its cutoff value for each record, while using the same code for every record. That should allow you to process any record/line in any file, regardless of the length of the next record being different from the last. This will also suggest to you that you should take this approach in programming: simple, effective, flexible, reusable code is better than something that has to be rewritten for every new collection of data.

We are in need of some more material about awk, so I have a gift for you. If you follow this link, you will find a pdf of a text about awk, made available by the gnu.org. I am also placing a copy of it in the items for this lesson on Blackboard. Awk is worth more than one chapter in a book, so here is a copy of a good book about it.

Some of the examples in the text are a bit complex, so let's consider a few more instructive ones. Assume we have a table of data that holds numbers in field 6. (I mention that this is a table to imply a reliable shape to the file.) Can you write an awk command to compute the sum of all the values in that column? The only problem is one of housekeeping.

This problem requires that we use an accumulator, a variable to which we will add each of the items in column 6. Let's call it total. Ideally, we want to start with an accumulator equal to 0, but if we include total=0 in the body of the awk command, we will reset it for each row, which we don't want to do. If this was an ordinary shell script, we would just initialize the variable outside the loop. Luckily, the awk language lets us use a BEGIN section that runs once, before the lines in the data are processed. The command could look like this:

awk 'BEGIN {total=0} {total=total+$6} END {print "Total of column 6:", total}'

I added an END section as well, which is processed only one time, to output the answer. You might have printed the running value of total as each line was processed instead, maybe printing the line number ($NR) along with it. You may be thinking that I have a loop that never loops here, but that is not so. The nature of an awk program is that the body is processed once for each record, or, if there is a pattern, once for each record that matches the pattern.

If I use the command above on the command line, it will work, but if I put it in a script, it may not. Putting awk commands in a script can be tricky, but a recommended method is to use a modified hashbang in your script. Try using this as your first line, modifying the path to lead to the awk command. And adding the -f to tell awk to read a file (the one you will be in).

#!/bin/awk -f
the awk program goes here, make sure it ends with a control-d.

The -f switch in the hashbang command tells awk that it will be reading a file for its program instructions, which makes it necessary to have a control-d, end of file marker, at the end of the script. Since awk is being invoked in the file that contains the script, the command line to call that script only needs to contain the script's name.

The method above should work for files that contain only an awk program. If there were other code in the file, you would not want to call awk first, and you would not want the control-d to be anywhere but the end of the file. What does that lead us to? If we want to use awk for a purpose in the midst of a shell script, call the awk program as a separate process, from which control can return to your shell script. Keeping the awk commands in separate scripts is a little fussy if there is only one command to run, but is much cleaner if you have a script of any length.

Regarding scope, calling an awk script from the command line or from another script starts awk as a new child instance, which means the awk program has no access to the variables in the script that calls the awk script. It also means that you cannot easily pass a variable from awk back to the calling script. Avoid starting a new shell in an awk script, and avoid using variables in awk that you will need to use in the script that called it..