LUX 211 - Shell Programming

Lesson 9: Programming Concepts

Objectives:

This lesson ends the discussion of shell programming and the use of AWK. Objectives important to this lesson:

  1. Planning a program
  2. Math: in the shell or in awk?
  3. Variables to store and to update

What else can we look into before the class is over?

Once upon a time, Professor Scott, who some of you knew at Baker, remarked that he thought that people came into programming classes in one of two ways. They either already thought like programmers, or they did not, and it was a rare student who learned the skill in the class. We will try to add some aspect of thinking like a programmer tonight for those of you who do not quite have it.

When you write a computer program, it is a lot like giving directions to someone who has to be told where to go, what to look for, what to pick up, what to drop, and what else to do along the way. Sounds like a game, doesn't it? For some of us, it is. If the adventure game simile doesn't attract you, hang on for a bit and we will get to something for you as well.

Let's consider the program assignment from last week. I gave you a data file that needed to be analyzed. The instructions said, in part:,

"Write an awk script that will parse the data, and for each county, print the county name, the population per square mile of land, and the percentage of the county that is water. (Note that the total area of the county is the sum of the land and water areas.)"

It also says:

"print the county name and the value for the following criteria: Highest population density, Lowest population density, Highest percentage of water, Lowest percentage of water. For example, you might print 'Highest population density: Adams County, 9999 people/square mile' ."

How do we start to do that? We already know that awk (or gawk) needs to read the data file. When in doubt, you should always ask three basic questions:

  • What are the outputs for this program?
  • What are the inputs I can work with?
  • What must my program do to produce those outputs?

What are the outputs for the program? There is more, but the paragraph says we need to output the county name, population per square mile of land, and percentage of the county that is water.

What is in the data file? Any of that? And do we need to worry about the four special lines the assignment wants? Yes, but not yet. Let's do the basic stuff first, then worry about the harder parts. Let's examine the data file.

The census.txt file contains 39 lines of data. Each line has four fields. The delimiter between fields is a tab. Field 1 holds the county's name. Field 2 holds the population of the county. Field 3 holds the number of square miles of water in the county. Field 4 holds the number of square miles of land in the county.

So for the first set of requirements, we have one field holding the data we want, and we must do some math for the other two columns. How do we get awk to do math? We will come back to that. If we only had to print out the county names, what would you do? Do we need a pattern, a program, or both? We want to process all lines, so a pattern to select lines is not needed. We need a program to tell awk what to do with each line. If we only wanted the county names, our program would be quite short: ' {print $1} '.

Since we want more than that, we need to consider how to do it. The second element in the report is population per square mile of land. Population is in Field 2. Square miles of land is in Field 4. In a story problem, "per" means to divide. So the next element can be obtained by Field 2 divided by Field 4. In awk, it is a good idea to calculate math problems first, then do the print step, all inside the program. Let's amend our program:

' { popdensity = $2 / $4 ; print $1, popdensity } '

Note that the math operation in the awk program is written differently from the way you would write the same problem in a bash command. Bash does not like spaces around regular math operators or operands, but awk wants them. Bash does want spaces around operators when you write a conditional statement. Remember where you are when writing a shell script. If you have invoked awk, use its syntax until you leave it.

It should be obvious that we need to add another math operation to compute the third column of output, and we need to amend the print command to print the outcome of that operation.

' { popdensity = $2 / $4 ; percentwater = $3 / ( $3 + $4 ) ; print $1, popdensity, percentwater } '

If we only cared about these items, we could call awk at the start of this command line, feed it the field delimiter, and show awk the data file at the end of the command line. We are only half way home, however. We still need to look into those four special lines. We could write a separate awk command to handle it, or we can amend the one we are working on. Let's use the one above, since it is already doing some of the work we need it to do.

We care about four more things, and the counties that are associated with them: highest population density, lowest population density, highest percentage of water, and lowest percentage of water. We can introduce eight variables for this. I will call them highpopd, highpopdcounty, lowpopd, lowpopdcounty, highwater, highwatercounty, lowwater, and lowwatercounty.

These eight variables give us a good reason to use the BEGIN and END sections our awk program is allowed to have. In the BEGIN section, we can initialize the values of the variables. The four county variables can be blank, because they will change anyway. The two high variables should be set small, but not zero. The two low variables should be set high, but not higher than the variable can hold. In the main section of the program we can change the values of the variable sets if the current line is higher or lower. This must take place after the density and percentage calculations have been made for the current line. In the END section we can print the values of the variables that should now hold the actual highest and lowest calculated values.

This brings us to a question whose answer should occur to most of you. Can't we just read the data file and feed our variables the correct county lines? The answer is no. We cannot do that, because it would not be correct when the data changes as it will when we have a new table from a new census, or a new state. If we rely on a programmer to select correct answers from a body of data, we are not running a program, we are running a scam. And it is unlikely that the programmer would be able to answer the questions as quickly as a real program if there were hundreds of pages of data.