This chapter discusses the use of regular expressions with the grep command. Regular expressions are patterns that you use to search text files. Objectives important to this chapter are:
Concepts:Grep stands for global regular expression printer. (Aren't you glad you know that now?) The reason it exists is to search for regular expressions, or patterns in text files, and print, or report, the lines where it finds them. Grep uses pattern matching operators, some of which we have already seen used in other ways. This means that the way grep uses some operators is not the way other commands use them. Some pattern matching operators are:
The caret is used in a text search string (regular expression, right?) to show the start of a line. The command grep "public" would search for lines of text that had the word "public" in them. However, the command grep "^public" searches for lines that have the word "public" as the first word on the line. The dollar sign is used by grep to mark the end of line in regular expressions. The command grep "public$" would find lines of text that have "public" as the last word on the line. Combining the two symbols, this command grep "^public$" would search for lines on which "public" is the only word. The basic syntax of the grep command is in the form grep pattern filename This syntax may be varied by either supplying a list of filenames, or by using expansion metacharacters like the asterisk, to specify all files in a group. (Such as, grep "stuff" *.txt.) Your author groups the next two operators together as "parentheses" operators. They are more properly referred to as the curly brace and the square bracket. (In case you are wondering, neither is a parenthesis.) Square brackets can be used, as we have seen before, to enclose a set of acceptable values. For instance, this command grep "[Bb]aker" schools.txt would find all lines that contained the word "Baker" or the word "baker" in the file schools.txt. Ranges of values may also be specified inside the brackets. Curly braces are used to specify the number of times a pattern must be found for the line to be printed. This is known as a closure. (Not all versions of UNIX support this operator.) When used with grep, each curly brace should be preceded by a backslash. The command grep "[0-9]\{5\}" address.txt will search the address.txt file for all lines containing five-digit numbers (like zip codes, for instance.) If we want lines containing numbers from five to seven digits, we could use the command grep "[0-9]\{5,7\}" address.txt where five is the least number of digits (repetitions) and seven is the greatest number of digits that will match the pattern. Other closure expressions: the plus, the asterisk and the question mark. The plus sign after a set of values means match on at least one appearance of one of the values. The asterisk after a set means to match on any or no appearances of the pattern specified. The question mark means to match on zero or one instance of the search pattern. The grep command has several useful options. Each would be preceded by the hyphen:
As a variation on the -v option, you can tell grep not to print any blank lines in a file. The command grep -v "^[ ]*" filename will do this, because the brackets enclose a space and a tab, and the star says match on one or none of them. This assumes that text files having a space or tab as the first character are blank lines. (My files of notes would not qualify, would they?) Empty lines are lines that contain only a line return character. To ignore them, we can use the command grep -v "^$" filename Grep has some cousins, egrep and fgrep. (Distant relations of the Addams family, no doubt.) The notable option in egrep is its use of the pipe as a logical OR operator. It is used inside parentheses and separates the choices. Sometimes the pattern you are searching for may occur inside other undesired patterns, such as inside other words. When this is so, you can run the more general search first and then pipe the output through a search that excludes what is undesired. This is not as good as it seems, however. In the example grep on filename | grep -v ion we search the file for the string "on", then send its output to next grep command as input, ignoring all lines having the string "ion" in them. Cool, but what about lines that have both words in them? Sometimes there is no substitute for real proofreading. |