LUX 211 - Shell Programming

Lesson 7: Chapter 14, AWK


This lesson discusses the use of AWK. Objectives important to this lesson:

  1. The AWK language
  2. Syntax
  3. Arguments
  4. Options
  5. Patterns
  6. Actions
  7. Variables
  8. Functions
  9. Control structures
Chapter 14

AWK is named for its authors, Alfred Aho, Peter Weinberger, and Brian Kernighan. It is a product that comes with UNIX and Linux. We can think of it as a language or a command with lots of syntax and options. The text jumps from AWK to gawk without any explanation on page 620. The Wikipedia page behind the link in the last sentence informs us that gawk is a newer, GNU version of AWK. The text finally mentions gawk as being Gnu awk, and that there is also a version called mawk, developed by Mike Brennan. (So he added his first initial.) This tutorial mentions other varieties, some long gone and some still around.

What is it good for? Lots of files in the UNIX/Linux world used to be text files, especially files that held data tables. AWK likes tables better than plain text, and can't really do anything with formatted text, like docx files. Log and event files created by processes are typically still text files. Output from one process to another usually can be treated as a text file. AWK was created to search through text files, and to return fields, records, or lines of data that match search criteria.

Two generic examples of gawk commands appear on page 620:

gawk [options] [program] [file list]
gawk [options] -f program file [file list]

They are not clear, so we need some examples. Instead of providing enlightening examples, the author begins a short tour of options.

  • In both templates above, the command starts with gawk, and like most commands, it is followed by any options you want to use.
  • At this moment, you have no idea what options you could even want to use, so let's assume none.
  • The next thing to care about on each line is the idea of a program. Awk is a programming language, so this third element is either a program that you enter on the command line, or the -f switch and the name of a file that contains the program you want to run. If you write the program on the command line, the text tells us that you must enclose the commands in a set of single quotes, which serve as a flag to the shell that this is the body of the program.
  • This makes the fourth (fifth?) item on the command line puzzling until you know that file list refers to a list of files that hold the data you wish to search/process.

On page 621, the text expands on the options you might want to use, providing us with a list of available options. First, options are preceded by a double hyphen, if you are using gawk. If you are not using gawk, use the options in the list that start with a single hyphen. This means that the notation is different for gawk, and that the names of the functions may be different as well. Note that four of the items in the short list are not available except in gawk. I think the author likes gawk.

On page 622, the text begins a section on the gawk language.

  • A gawk program usually has two parts: a pattern (something to search for) and an action (something to do when match to the pattern is found).
  • gawk is concerned with lines of text in files. It searches for patterns in lines, and takes action on those lines.
  • A gawk program can run without having a pattern. If there is no pattern, all lines in the files in the file list are processed by the action.
  • The action part of a gawk program is enclosed in curly braces to separate it from the pattern. If the program has no action portion, the default action is to copy selected lines to standard output.
  • A program is allowed to have several patterns, each with its own associated action. If a line matches more than one pattern, the matching actions are taken in the order they appear in the program.
  • A gawk program may have no pattern, or it may have no action, but it must have one or the other.
  • A gawk/awk program can use the keyword BEGIN and END to mark commands that are run only when the program starts and ends. This may seem odd, but the idea can be used to print a header and a footer for a report, while the rest of the code populates the report with data from each qualifying record.

To illustrate what a gawk program is supposed to do, download the csv file that accompanies this week's lesson on Blackboard. For those who cannot access Blackboard, this file holds several records, each of which is about a student in this class. The first field is last name, the second field is first name, and the last field is state. This is a comma delimited file, a type which can be read and written by Excel. We can use it for an exercise.

  1. In your VM for this class, open a browser, navigate to the Blackboard page for this lesson, and download the CSV file to your home directory.
  2. Use cat on the file to see what it looks like.
  3. Now use gawk to do the same thing. On a command line enter the following:

    gawk -F, {print $1, $2, $3} LUX211just3col.csv

    In the command above, we are not using a pattern, so all lines of the file will be selected.
    The -F is an option to specify the field delimiter. (Yes, the F has to be a capital, because a lower case f means something else to gawk/awk.)
    For this file, the delimiter is a comma, so a comma appears immediately after the option switch.
    gawk uses $1, $2, and so on to refer to fields in the records. This program uses all three, in the order they occur in the file, to simulate the cat command.
  4. Now, do it again, but change the order of the fields or leave one out. That is more flexible than cat.
  5. In this step, we want to use a pattern to select only particular records. This pattern is more of a rule.

    gawk -F, '$3 ~ "MI" {print $1, $2}' LUX211just3col.csv 

    This command looks for records that contain MI in field 3, and prints fields 1 and 2 when this is so. Using the tilde is more forgiving than using a double equal sign, which checks for an exact match.
    Use the tilde when you are sure that the data will cooperate, and when you are unsure of how the record ends. How it what? Application programs like Microsoft Office programs are notorious for appending hidden characters to the end of a record, or a field, so your tests may need to be less precise than you might think.

Pages 622 to 625 list several logical and mathematical operators we can use in gawk/awk patterns and programs. These may be helpful to you when writing programs. Two other concepts are discussed in this section: variables and functions.

Page 624 lists several predefined variables and their default values. The text calls the variables in this list program variables. They already exist for any program you write in this language. You are allowed to create and use your own variables as well. A list of eight functions also appears on this page and the next. They are very specific, so you may not use them all very frequently.

You should read through the programs in the examples section of the chapter to become familiar with the syntax and usage of the commands. Do this on your own. For this lesson, let's look at one example that uses some of the features above, and contains a for loop we have not seen before.

Turn to page 629 first, to see the data file that the author uses for most of his examples. The file is called cars. It contains twelve records, each of which have five fields. You may want to create a copy of this file so you can practice the example programs. I have uploaded a comma delimited version of the file to the Week 7 folder. For our discussion, turn to page 642, and read the tally program.

This program has a BEGIN and an END section. BEGIN is preceded by the opening single quote that marks the beginning of the gawk program. END is followed by the closing single quote, and a redirection operator to take input from the a new data file called numbers.

  1. The BEGIN section only sets a value for the ORS variable, the Output Record Separator. Looking at the expected output on page 642, this appears to be a newline character. I think I would have use \n to represent it.
  2. A new program section opens with NR == 1, which is a pattern that tells the shell to do the following only for record 1. What it does is count the number of fields in that record, and assign the value to the variable nfields.
  3. A new program section is opened at the bottom of page 642, which starts with an if conditional statement.

    if ($0 ~ /[^0-9. \t]/)

    In that cryptic conditional, we need to understand what is happening. $0 stands for the entire record currently being read. This will happen for every record. The tilde means "contains". The material between the two forward slashes is interpreted as a regular expression. The expression is enclosed in square brackets, so this will define a set of characters. The caret (^) is a special character that means to look at the beginning of something, which in this case is the record. The defined set should include everything we will allow in a record. In this case, we are looking for records that may contain the digits 0 through 9, decimal points, spaces, and tab characters. The only problem with this example is that this module is meant to print out the bad records, and it is checking for good ones, so the logic should be amended to include a not operator: if ($0 ~ /![^0-9. \t]/)

  4. The opening if structure is followed by an else, which leads us to the loop I wanted you to see.

    for (count = 1; count <= nfields; count++)

    This is an elegant structure. Its purpose is to run a process for each field in a record. The first phrase, count = 1, initializes a variable to count how many times the loop has been run. It will run once for each field. This phrase, no matter what it says, will be executed one and only one time for each record: on the first pass through the loop. The second phrase is the test condition: do we run the loop this time? This condition will be true as long as the count is less than or equal to the value of the variable that holds the number of fields in each record. The third phrase is the exit phrase, and it is executed each time the loop is completed: each time a field has been processed. It increments the count variable by 1.

    What happens in the body of the loop? A line is printed to a file with the printf command. This command has lots of options and switches to control the formatting of your data. Read this article on the subject and you will see examples that explain that the number stored in the field is being printed with two decimal places, with at least two digits to the left of the decimal point.
    The loop also adds the current value of count to an item in an array that will hold column totals, and to the a running total of all items in the report.

I will leave you to examine this program outside of class to decide if it has more errors in it.

Before we quit for this week, Doug gave several students access to a tool that the text has not mentioned. It is not in the index either, so let's use another source, as we often must, to learn about the basic calculator in bash. The essence of the lesson is that we can pipe a string containing a math problem to the bc program, which will return an answer. That returned answer can be captured in a variable.