LUX 205 - Introduction to Linux/UNIX

Chapter 5, Advanced File Processing

Objectives:

This lesson discusses more operations you can perform on files from a command line in UNIX and Linux. Objectives important to this lesson:

  1. Using the pipe operator
  2. Using grep to search files
  3. Using uniq to remove duplicate lines
  4. Using comm and diff to compare files
  5. Using the wc command to count characters, words, and lines
  6. sed, tr, and pr
  7. Writing and using a shell script
Concepts:

This chapter sets a long list of objectives, but it is another short one, so take heart about that. The chapter begins with a review of some commands and a quick summary of others.
On pages 214 and 215, the text gives us two tables that organize several commands we know and several new ones into two groups:

Selection commands are commands that extract information from files.

  • comm - compares files, shows differences
  • cut - selects columns of data
  • diff - compares files, selects differences
  • grep - searches for data, selects lines or rows where it is found
  • head - selects and shows lines from the beginning of a file
  • tail - selects and shows lines from the end of a file
  • uniq - selects unique lines (rows) in a file
  • wc - counts characters, words, or lines

Manipulation and transformation commands do something with or to data or files.

  • awk - starts and instance of awk, which allows you to use awk commands to manipulate data
  • cat - creates, combines, and/or displays files
  • chmod - changes a file's security mode, which means it is used to grant or remove rights to a file
  • join - assuming files contain tables of data, this combines those tables into one table, like doing a SELECT command in Oracle
  • paste - takes columnar data from two files and creates a table containing those columns in a new file
  • pr - prints selected files
  • sed - performs edits on data in files
  • sort - places data in a specified order
  • tr - tr means translate; this is used to substitute one character for another character, or to remove characters from a body of data
    (The text is not clear about why this is useful. Read this web page about the command for a better explanation of its use.)

The text explains that the pipe operator (|) is another redirection operator. The explanation on page 215 is correct. If you put a pipe on the command line between two commands, the first command will be executed and its output will be sent (piped) to the second command as input. The generic form is like this:
command | other_command
This would allow the first command to run normally, but take its output and hand it to the other command as input. The text offers an example. It proposes asking for a listing of a large directory, but piping the output through the more command to enable the user to view the results one full screen at a time. In the second example, on page 216, the text suggests asking for a large listing, piping to a sort command, then piping to more. We should not assume that all pipes lead to more. It is just an easy example.

Regarding commands we have already discussed, the cut command can be viewed as a way of displaying only a part of the information found in some file. Its syntax reflects the fact that, like most UNIX commands, it looks at a file one line at a time. The command
cut -c8 filename
would return the 8th character from every line in the file called filename.

If we want to see a sequence of characters, we might use
cut -c5-15 filename
which would return the 5th through 15th characters from each line in the file.

If the file is organized (as we have seen) into fields, with all fields separated (or delimited) by the same character (like a colon, or a tab), we can tell cut to show us certain fields.
cut -d: -f2,4 filename
tells the cut command that the fields are separated (delimited) by colons, that we want the 2nd and 4th fields in the file, and that the file is called filename. We can leave out the -d option if the field separator is a tab. The cut command will assume the separator is a tab if it is not specified with the -d switch.

The grep command is discussed next. The text suggests three possible meanings for grep. The one I recall is their second choice: Global Regular Expression Parser. Global in the sense that it will search through a file, a list of files, or all the files in a folder. Parser in the sense that it looks through the parts of a file (characters and words). Regular Expression because that is what someone once called search strings that are allowed to include wildcards and other operators. Note the options on page 217. The default behavior is to return filenames and lines in those files that match the search string, but you can use the -l option to limit the return to just a list of filenames that contain hits.

The uniq command has a very specific function. You feed it a file that consists of lines of text. It examines each line, and it returns each line, but only if the line it just returned does not match the current line. In this way, it returns one and only one copy of each combination of characters in a line found in that file, as long as the file was sorted alphabetically to start with. You may wonder, what the hell good is that?!? Well, it was written before the sort command had a -u switch, which does the same thing, but it sorts the file first then looks for "unique" lines. UNIX admins are not known for updating their systems, so the sort filename | uniq method will still work, even if the "newer" version of sort has not been installed. Note: this filter method is not meant to be used when editing text that depends on multiple instances of the same string.

The comm command compares two files, and it produces three columns of output:

  • lines that are only found in file 1
  • lines that are only found in file 2
  • lines that are found in both file 1 and file 2

If you want a mnemonic for the comm command, remember that the third column reports lines that are common to both files. Note that each column can be suppressed in the output by using -column_number as a switch.

The diff command examines two files that are supposed to be similar, and it gives us a report about the lines in each that are different. Take a look at the discussion on pages 221 and 222, then be glad we will not be wasting a lot of time on this command. It may be wonderful for instances that need it, but it is hard to imagine finding ourselves in such an instance.

The wc, word count, command is used to count three kinds of things about a text file. It can count the number of words (-w), the number of lines (-l), the number of bytes (-c), or any combination of those three options. Note that the switch for bytes is -c, which assumes one byte per character, as in the ASCII or extended ASCII sets.

The text explains that you may (will?) sometimes want to make global changes to huge files. In a case like that, you want an editing tool that is made to dig into large bodies of data. The tool the text describes is sed, whose name may stand for stream editor. It helps to have already accepted the concept of streams, which can be treated as files. sed is happy to apply your instructions to each line in several files tirelessly, where a human (or other biological life form) might grow inattentive and error prone. The link I have just given you goes to a web page with much more information about sed than we are given in the current chapter in your text.

One method of using sed is from the command line, such as
sed -e "s/text_to_find/text_to_write" filename

The -e option means to read the sed commands entered on the command line (as opposed to -f, which means to read them in a sed script). The quoted text has three parts, separated by slashes. The "s" means to substitute text. The "text_to_find" is the text that sed will search for on each line. The "text_to_write" is the text that will replace the "text_to_find" on each line. Finally, filename could be the name of a specific file to process, or a phrase that expands to many filenames.

The command may also specify a line number or a range of line numbers to process in the named file(s), however this may not be a good way to do it, because it gets tricky.
sed -e "1,250s/text_to_find/text_to_write" filename

This example means to process the sed substitute command on lines 1 through 250 in filename. If filename above expands to mean several files, then the first one in the list contains lines 1 through x. The second file contains line numbers x+1 through y, the third contains line numbers y+1 through z, and so on. For the purposes of sed, all lines in all files in the list are considered as though they were actually consecutive lines in a single file, with ever increasing line numbers. If we intend to process all lines in the first file, and half the lines in the second file, we had better use the wc command to count the lines first.

Using a sed script file is similar to the section above:
sed -f sedcommands filename
In this example, the -f option means that the next argument is a file (a text file) of sed commands that are to be processed on the expansion of filename.

The sed command also has a delete option. Remember that, being a line editor, it deletes entire lines. With that in mind, you can use
sed -e "10,15d" filename
to delete lines 10 through 15 in filename.

Your text explains that sed can append to a file, but it might also be thought of as in insertion. You have to use the a\nnn switch, in which the nnn represents the line to begin appending after. If you do not supply a line number, sed will append at the end of the file you are editing.

The text explains the tr (translate) command as being useful for translating sets of characters. It is hard to see what value this one has, so let's take a look at some examples from another source. It is clearer from the material on the web site that the two character strings you need to supply can be specified in different ways, but they usually must be the same length. There are some exceptions to that "rule" in the examples on the web page.

The pr command prints to the standard output stream. Its format assumes you want paged output in standard 66-line pages. (That makes 6 lines per vertical inch on 11 inch sheets of paper.) In the 66 lines, the command assumes a 5 line header, which displays information about the file by default. There is also a 5 line footer, which the text calls a trailer. You can override the default settings with optional switches, as noted on page 225.

On page 225, the text begins a new section that leads into a major project: designing an application that makes use of the functions of the operating system we have discussed. It is debatable whether the end product produced by the projects in the chapter is really an application, since it produces no machine language files. It does, however, give us a reason to discuss several points about application design which will help you in this project and in others that will follow in other classes.

  • Projects often start by specifying what the outputs or reports of the system must be. You do this to determine two more things: what data must we have to produce the reports, and what processing will be needed to turn that data into those reports.
  • Once you define your data needs, you need to determine how you will store that data. Does this require only one file of data? Is this a more complex system that needs data stored in separate tables, such as you would use in a relational database? If so, what are the key fields in the tables that will link one data table to another? Designing and troubleshooting your data tables may be a challenge if you have not had a database class that dealt with normalization of data. Luckily for us, the text furnishes a design for this project.
  • Instead of using a programming language that requires compiling its commands into machine language, we will be creating a shell script to execute the processes that are needed. The text advises us to write modules of code that can be tested separately. It will not be pretty if you have a long program fail whose parts have never been tested separately. You may have no idea what to troubleshoot. This concept is supported by having separate modules that can be completed in the chapter projects.
  • The text also recommends that we include remarks (comments, internal documentation) in our shell scripts. It is not something every programmer does, but it should be done to explain the functions of a program to those who support the program after you leave the project. Remarks are also very beneficial as markers for sections of the program, and as reminders to you of what you have done and what you are trying to do. They make a world of difference when you look at a program you have not seen for a while.

Browse the pages at the end of the chapter to get an idea of what this project is about. You may want to practice skills with the various commands in this chapter by doing some of projects 5-1 through 5-11.