Chapter 5, Advance File Processing

ITS 2310 - Linux I

Chapter 5, Advanced File Processing

Objectives:

This lesson discusses more operations you can perform on files from a command line in UNIX and Linux. Objectives important to this lesson:

Using the pipe operator

Using grep to search files

Using uniq to remove duplicate lines

Using comm and diff to compare files

Using the wc command to count characters, words, and lines

sed, tr, and pr

Concepts:

This chapter sets a long list of objectives, but it is another short one, so take heart about that. The chapter begins with a review of some commands and a quick summary of others.
On pages 214 and 215, the text gives us two tables that organize several commands we know and several new ones into two groups:

Selection commands are commands that extract information from files.

comm - compares files, shows differences

cut - selects columns of data

diff - compares files, selects differences

grep - searches for data, selects lines or rows where it is found

head - selects and shows lines from the beginning of a file

tail - selects and shows lines from the end of a file

uniq - selects unique lines (rows) in a file

wc - counts characters, words, or lines

Manipulation and transformation commands do something with or to data or files.

awk - starts an instance of awk, which allows you to use awk commands to manipulate data (yes, it's a language)

cat - creates, combines, and/or displays files

chmod - changes a file's security mode, which means it is used to grant or remove rights to a file

join - assuming files contain tables of data, this combines those tables into one table, like doing a SELECT command in Oracle

paste - takes columnar data from two files and creates a table containing those columns in a new file

pr - prints selected files

sed - performs edits on data in files

sort - places data in a specified order

tr - tr means translate; this is used to substitute one character for another character, or to remove characters from a body of data
(The text is not clear about why this is useful. Read this web page about the command for a better explanation of its use.)

The text explains the pipe operator (|) again, telling us it is another redirection operator. The explanation on page 215 is correct. If you put a pipe on the command line between two commands, the first command will be executed and its output will be sent (piped) to the second command as input. The generic form is like this:
command | other_command
This would allow the first command to run normally, but take its output and hand it to the other command as input. The text offers an example. It proposes asking for a listing of a large directory, but piping the output through the more command to enable the user to view the results one screen at a time. In the second example, on page 216, the text suggests asking for a large listing, piping to a sort command, then piping to more. We should not assume that all pipes lead to more. It is just an easy example.

Regarding other commands we have already discussed, the cut command can be viewed as a way of displaying only a part of the information found in some file. Its syntax reflects the fact that, like most UNIX commands, it looks at a file one line at a time. The command
cut -c8 filename
would return the 8th character from every line in the file called filename.This is not very interesting unless you know that a file was encrypted by a third grader.

If we want to see a sequence of characters, we might use
cut -c5-15 filename
which would return the 5th through 15th characters from each line in the file.

If the file is organized (as we have seen) into fields, with all fields separated (or delimited) by the same character (like a colon, or a tab), we can tell cut to show us certain fields. This gets more useful for data files. Example:
cut -d: -f2,4 filename
tells the cut command that the fields are separated (delimited) by colons, that we want the 2nd and 4th fields in the file, and that the file is called filename. We can leave out the -d option if the field separator is a tab. The cut command will assume the separator is a tab if it is not specified with the -d switch.

It is more useful to look at the commands with more functions, like sed and grep. The video shown below has a good introduction to sed, with an example usage that the presenter has found worth knowing.

The grep command is discussed next. The text suggests three possible meanings for grep. The one I recall is their second choice: Global Regular Expression Parser. Global in the sense that it will search through a file, a list of files, or all the files in a folder. Parser in the sense that it looks through the parts of a file (characters and words). Regular Expression because that is what someone once called search strings that are allowed to include wildcards and other operators. Note the options on page 217. The default behavior is to return filenames and lines in those files that match the search string, but you can use the -l option to limit the return to just a list of filenames that contain hits.

To save a bit of time, the video below has an introduction to several commands. It's a long video, but you can move the reader to the sections you currently care about.

1:25 grep
6:14 piping output into commands
9:36 sed and awk
17:42 more awk
30:21 less
35:46 find / exec
47:55 gzip, gunzip, tar

The uniq command has a very specific function. You feed it a file that consists of lines of text. It examines each line, and it returns each line, but only if the line it just returned does not match the current line. In this way, it returns one and only one copy of each combination of characters in a line found in that file, as long as the file was sorted alphabetically to start with. You may wonder, what good is that?!? Well, it was written before the sort command had a -u switch, which does the same thing, but it sorts the file first then looks for "unique" lines. UNIX admins are not known for updating their systems, so the sort filename | uniq method will still work, even if the "newer" version of sort has not been installed. Note: this filter method is not meant to be used when editing text that depends on multiple instances of the same string.

The comm command compares two files, and it produces three columns of output:

lines that are only found in file 1

lines that are only found in file 2

lines that are found in both file 1 and file 2

If you want a mnemonic for the comm command, remember that the third column reports lines that are common to both files. Note that each column can be suppressed in the output by using -column_number as a switch.

The diff command examines two files that are supposed to be similar, and it gives us a report about the lines in each that are different. Take a look at the discussion on pages 221 and 222, then be glad we will not be wasting a lot of time on this command. It may be wonderful for instances that need it, but it is hard to imagine finding ourselves in such an instance.

The wc, word count, command is used to count three kinds of things about a text file. It can count the number of words (-w), the number of lines (-l), the number of bytes (-c), or any combination of those three options. Note that the switch for bytes is -c, which assumes one byte per character, as in the ASCII or extended ASCII sets.

The text explains that you may (will?) sometimes want to make global changes to huge files. In a case like that, you want an editing tool that is made to dig into large bodies of data. The tool the text describes is sed, whose name may stand for stream editor. It helps to have already accepted the concept of streams, which can be treated as files. sed is happy to apply your instructions to each line in several files tirelessly, where a human (or other biological life form) might grow inattentive and error prone. The link I have just given you goes to a web page with much more information about sed than we are given in the current chapter in your text.

One method of using sed is from the command line, such as
sed -e "s/text_to_find/text_to_write" filename

The -e option means to read the sed commands entered on the command line (as opposed to -f, which means to read them in a sed script). The quoted text has three parts, separated by slashes. The "s" means to substitute text. The "text_to_find" is the text that sed will search for on each line. The "text_to_write" is the text that will replace the "text_to_find" on each line. Finally, filename could be the name of a specific file to process, or a phrase that expands to many filenames.

The command may also specify a line number or a range of line numbers to process in the named file(s), however this may not be a good way to do it, because it gets tricky.
sed -e "1,250s/text_to_find/text_to_write" filename

This example means to process the sed substitute command on lines 1 through 250 in filename. If filename above expands to mean several files, then the first one in the list contains lines 1 through x. The second file contains line numbers x+1 through y, the third contains line numbers y+1 through z, and so on. For the purposes of sed, all lines in all files in the list are considered as though they were actually consecutive lines in a single file, with ever increasing line numbers. If we intend to process all lines in the first file, and half the lines in the second file, we had better use the wc command to count the lines first.

Using a sed script file is similar to the section above:
sed -f sed commands filename
In this example, the -f option means that the next argument is a file (a text file) of sed commands that are to be processed on the expansion of filename.

The sed command also has a delete option. Remember that, being a line editor, it deletes entire lines. With that in mind, you can use
sed -e "10,15d" filename
to delete lines 10 through 15 in filename.

Your text explains that sed can append to a file, but it might also be thought of as in insertion. You have to use the a\nnn switch, in which the nnn represents the line to begin appending after. If you do not supply a line number, sed will append at the end of the file you are editing.

The text explains the tr (translate) command as being useful for translating sets of characters. It is hard to see what value this one has, so let's take a look at some examples from another source. It is clearer from the material on the web site that the two character strings you need to supply can be specified in different ways, but they usually must be the same length. There are some exceptions to that "rule" in the examples on the web page.

The pr command prints to the standard output stream. Its format assumes you want paged output in standard 66-line pages. (That makes 6 lines per vertical inch on 11 inch sheets of paper.) In the 66 lines, the command assumes a 5 line header, which displays information about the file by default. There is also a 5 line footer, which the text calls a trailer. You can override the default settings with optional switches, as noted on page 225.