Teach Yourself UNIX Shell Programming in 14 Days

Chapter 5: Life with Regular Expressions

Objectives:

This chapter discusses the use of regular expressions with the grep command. Regular expressions are patterns that you use to search text files. Objectives important to this chapter are:

  • Use of grep
  • Pattern matching operators
  • New uses for metacharacters in grep
Concepts:

Grep stands for global regular expression printer. (Aren't you glad you know that now?) The reason it exists is to search for regular expressions, or patterns in text files, and print, or report, the lines where it finds them.

Grep uses pattern matching operators, some of which we have already seen used in other ways. This means that the way grep uses some operators is not the way other commands use them. Some pattern matching operators are:

  • the caret - beginning of line marker
  • the dollar sign - end of line marker
  • the period - a wild card operator, matches one character or none
  • the asterisk - matches any characters
  • curly braces - required number of repetitions (closures)
  • square brackets - accepted sets of values

The caret is used in a text search string (regular expression, right?) to show the start of a line. The command

	grep "public"

would search for lines of text that had the word "public" in them. However, the command

	grep "^public" 

searches for lines that have the word "public" as the first word on the line.

The dollar sign is used by grep to mark the end of line in regular expressions. The command

	grep "public$"

would find lines of text that have "public" as the last word on the line.

Combining the two symbols, this command

	grep "^public$"

would search for lines on which "public" is the only word.

The basic syntax of the grep command is in the form

	grep pattern filename

This syntax may be varied by either supplying a list of filenames, or by using expansion metacharacters like the asterisk, to specify all files in a group. (Such as, grep "stuff" *.txt.)

Your author groups the next two operators together as "parentheses" operators. They are more properly referred to as the curly brace and the square bracket. (In case you are wondering, neither is a parenthesis.)

Square brackets can be used, as we have seen before, to enclose a set of acceptable values. For instance, this command

	grep "[Bb]aker" schools.txt

would find all lines that contained the word "Baker" or the word "baker" in the file schools.txt. Ranges of values may also be specified inside the brackets.

Curly braces are used to specify the number of times a pattern must be found for the line to be printed. This is known as a closure. (Not all versions of UNIX support this operator.) When used with grep, each curly brace should be preceded by a backslash. The command

	grep "[0-9]\{5\}" address.txt

will search the address.txt file for all lines containing five-digit numbers (like zip codes, for instance.) If we want lines containing numbers from five to seven digits, we could use the command

	grep "[0-9]\{5,7\}" address.txt

where five is the least number of digits (repetitions) and seven is the greatest number of digits that will match the pattern.

Other closure expressions: the plus, the asterisk and the question mark. The plus sign after a set of values means match on at least one appearance of one of the values. The asterisk after a set means to match on any or no appearances of the pattern specified. The question mark means to match on zero or one instance of the search pattern.

The grep command has several useful options. Each would be preceded by the hyphen:

  • i - means to ignorecase
  • n - means to print linenumbers
  • v - print only lines that do notmatch the pattern
  • c - print only a count of the lines that matched the pattern

 

As a variation on the -v option, you can tell grep not to print any blank lines in a file. The command

	grep -v "^[ 	]*" filename

will do this, because the brackets enclose a space and a tab, and the star says match on one or none of them. This assumes that text files having a space or tab as the first character are blank lines. (My files of notes would not qualify, would they?)

Empty lines are lines that contain only a line return character. To ignore them, we can use the command

	grep -v "^$" filename

Grep has some cousins, egrep and fgrep. (Distant relations of the Addams family, no doubt.) The notable option in egrep is its use of the pipe as a logical OR operator. It is used inside parentheses and separates the choices.

Sometimes the pattern you are searching for may occur inside other undesired patterns, such as inside other words. When this is so, you can run the more general search first and then pipe the output through a search that excludes what is undesired. This is not as good as it seems, however. In the example

grep on filename | grep -v ion

we search the file for the string "on", then send its output to next grep command as input, ignoring all lines having the string "ion" in them. Cool, but what about lines that have both words in them? Sometimes there is no substitute for real proofreading.