This lesson discusses more operations you can perform on
files from a command line in UNIX and Linux. Objectives
important to this lesson:
Using the pipe
operator
Using grep to
search files
Using uniq to
remove duplicate lines
Using comm and diff to compare files
Using the wc
command to count characters, words, and lines
sed, tr,
and pr
Concepts:
This chapter sets a long list of objectives, but it is another
short one, so take heart about that. The chapter begins with a
review of some commands and a quick summary of others.
On pages 214 and 215, the text gives us two tables that organize
several commands we know and several new ones into two groups:
Selection commands are
commands that extract information from files.
comm - compares
files, shows differences
cut - selects
columns of data
diff - compares
files, selects differences
grep - searches for
data, selects lines or rows where it is found
head - selects and
shows lines from the beginning of a file
tail - selects and
shows lines from the end of a file
uniq - selects
unique lines (rows) in a file
wc - counts
characters, words, or lines
Manipulation and
transformation commands do something with or to data or
files.
awk - starts an
instance of awk, which allows you to use awk commands to
manipulate data (yes, it's a language)
cat - creates,
combines, and/or displays files
chmod - changes a
file's security mode, which means it is used to grant or
remove rights to a file
join - assuming
files contain tables of data, this combines those tables into
one table, like doing a SELECT command in Oracle
paste - takes
columnar data from two files and creates a table containing
those columns in a new file
pr - prints selected
files
sed - performs edits
on data in files
sort - places data
in a specified order
tr - tr means
translate; this is used to substitute one character for
another character, or to remove characters from a body of data
(The text is not clear about why this is useful. Read this web page about the command for a
better explanation of its use.)
The text explains the pipe
operator (|) again,
telling us it is another redirection operator. The
explanation on page 215 is correct. If you put a pipe on the
command line between two commands, the first command will be
executed and its output will be sent (piped) to the second
command as input. The generic form is like this: command | other_command
This would allow the first command to run normally, but take its
output and hand it to the other command as input. The text
offers an example. It proposes asking for a listing
of a large directory, but piping the output through the more
command to enable the user to view the results one screen at a
time. In the second example, on page 216, the text suggests
asking for a large listing,
piping to a sort
command, then piping to more.
We should not assume that all pipes lead to more. It is just an
easy example.
Regarding other commands we have already discussed, the cut
command can be viewed as a way of displaying only a part of the
information found in some file. Its syntax reflects the fact
that, like most UNIX commands, it looks at a file one line
at a time. The command
cut -c8 filename would return the 8th character from everyline
in the file called filename.This is not very
interesting unless you know that a file was encrypted by a third
grader.
If we want to see a sequence of characters, we might use
cut -c5-15 filename
which would return the 5th through 15th
characters from each line in the file.
If the file is organized (as we have seen) into fields,
with all fields separated (or delimited) by the
same character (like a colon, or a tab), we can tell cut
to show us certain fields. This gets more useful for data files.
Example:
cut -d:-f2,4filename
tells the cut command that the fields are separated (delimited)
by colons, that we want the 2nd and 4th
fields in the file, and that the file is called filename.
We can leave out the -d option if the field separator is
a tab. The cut command will assume the separator
is a tab if it is not specified with the -d switch.
It is more useful to look at the commands with more functions,
like sed and grep. The video shown below has a good introduction
to sed, with an example usage that the presenter has found worth
knowing.
The grep command is
discussed next. The text suggests three possible meanings for
grep. The one I recall is their second choice: Global
Regular Expression Parser. Global in the sense that it
will search through a file, a list of files, or all the files in
a folder. Parser in the sense that it looks through the parts of
a file (characters and words). Regular Expression because that
is what someone once called search strings that are allowed to
include wildcards and other operators. Note the options on page
217. The default behavior is to return filenames and lines in
those files that match the search string, but you can use the -l
option to limit the return to just a list of filenames that
contain hits.
To save a bit of time, the video below has an introduction to
several commands. It's a long video, but you can move the reader
to the sections you currently care about.
1:25 grep
6:14 piping output into commands
9:36 sed and awk
17:42 more awk
30:21 less
35:46 find / exec
47:55 gzip, gunzip, tar
The uniq command has a
very specific function. You feed it a file
that consists of lines of text.
It examines each line,
and it returns each line, but only if the line it just returned
does not match the current line. In this way, it returns one and
only one copy of each combination of characters in a line found
in that file, as long as the file was sorted alphabetically to
start with. You may wonder, what good is that?!? Well, it was
written before the sort
command had a -u
switch, which does the same thing, but it sorts the file first
then looks for "unique" lines. UNIX admins are not known for
updating their systems, so the sort
filename | uniq
method will still work, even if the "newer" version of sort has
not been installed. Note: this filter method is not meant to be
used when editing text that depends on multiple instances of the same string.
The comm command compares two files, and it
produces three columns of output:
lines that are only found in file 1
lines that are only found in file 2
lines that are found in both file 1 and file 2
If you want a mnemonic
for the comm command, remember that the third column reports
lines that are common
to both files. Note that each column can be suppressed in the
output by using -column_number
as a switch.
The diff command
examines two files that are supposed to be similar, and it gives
us a report about the lines in each that are different. Take a
look at the discussion on pages 221 and 222, then be glad we
will not be wasting a lot of time on this command. It may be
wonderful for instances that need it, but it is hard to imagine
finding ourselves in such an instance.
The wc, word
count, command is used to count three kinds of things
about a text file. It can count the number of words
(-w), the number of lines (-l),
the number of bytes (-c), or any combination of
those three options. Note that the switch for bytes
is -c, which assumes
one byte per character, as in the ASCII or extended ASCII sets.
The text explains that you may (will?) sometimes want to make
global changes to huge files. In a case like that, you want an
editing tool that is made to dig into large bodies of data. The
tool the text describes is sed, whose name may stand for stream
editor. It helps to have already accepted the concept of
streams, which can be treated as files. sed is happy to apply
your instructions to each line in several files tirelessly,
where a human (or other biological life form) might grow
inattentive and error prone. The link I have just given you goes
to a web
page with much more information about sed than we
are given in the current chapter in your text.
One method of using sed is from the command line, such
as
sed -e "s/text_to_find/text_to_write" filename
The -e option means to read the sed commands entered
on the command line (as opposed to -f, which means to
read them in a sed script). The quoted text has three
parts, separated by slashes. The "s" means to substitute
text. The "text_to_find" is the text that sed
will search for on each line. The "text_to_write"
is the text that will replace the "text_to_find" on each
line. Finally, filename could be the name of a specific
file to process, or a phrase that expands to many filenames.
The command may also specify a linenumber or
a range of line numbers to process in the named file(s),
however this may not be a good way to do it, because it gets tricky.
sed -e "1,250s/text_to_find/text_to_write"
filename
This example means to process the sed substitute
command on lines 1 through 250 in filename. If filename
above expands to mean several files, then the first
one in the list contains lines 1 through x. The second
file contains line numbers x+1 through y, the third
contains line numbers y+1 through z, and so on. For the
purposes of sed,all lines in all files
in the list are considered as though they were actually consecutive
lines in a single file, with ever increasing line
numbers. If we intend to process all lines in the first file,
and half the lines in the second file, we had better use the wc
command to count the lines first.
Using a sed script file is similar to the section
above:
sed -f sed commandsfilename
In this example, the -f option means that the next
argument is a file (a text file) of sed commands
that are to be processed on the expansion of filename.
The sed command also has a delete option. Remember
that, being a line editor, it deletes entire lines. With
that in mind, you can use
sed -e "10,15d" filename
to delete lines 10 through 15 in filename.
Your text explains that sed can append to a file, but
it might also be thought of as in insertion. You have to use the
a\nnn switch, in which the nnn represents
the line to begin appending after. If you do not supply a line
number, sed will append at the end of the file you are editing.
The text explains the tr (translate) command as being
useful for translating sets of characters. It is hard to see
what value this one has, so let's take a look at some examples
from another source. It is clearer from the material
on the web site that the two character strings you need to
supply can be specified in different ways, but they usually
must be the same length. There are some exceptions to that
"rule" in the examples on the web page.
The pr command prints to the standard output stream.
Its format assumes you want paged output in standard 66-line
pages. (That makes 6 lines per vertical inch on 11 inch
sheets of paper.) In the 66 lines, the command assumes a 5 line
header, which displays information about the file by
default. There is also a 5 line footer, which the text
calls a trailer. You can override the default settings
with optional switches, as noted on page 225.
Week 5 Assignments
Discussion 3 and Test 3
are due by 6pm on our class day next week,