Chapter 4, UNIX/Linux File Processing

ITS 2310 - Linux I

Chapter 4, UNIX/Linux File Processing

Objectives:

This lesson discusses operations you can perform on files from a command line in UNIX and Linux. Objectives important to this lesson:

File processing
Basic file operations: create, delete. copy, move
More file operations: combine, cut, paste, rearrange, sort
Creating a script
Using the awk command

Concepts:

This chapter offers some practical commands that can be used with the file system knowledge you gained from previous chapters. As noted in the objectives on page 157 (and those listed above), you should be able to manipulate files from a Linux command line by the end of this lesson.

In the last chapter, the text wanted to cluster all files into two types: ASCII files and binary files. The concept is expanded a bit in chapter 4. We are told that text files and binary files are both classified by the file system as regular files, which is what the initial dash in their permissions lists means when viewing the output of an ls -l command. The text informs us that some of the files it mentioned in the last chapter, device special files, are further categorized as character special files and block special files, each of which will have different type tags.

Character special files are used one character at a time, and their type tag is c. These are often sent to displays.
Block special files are used to process blocks of data at a time, and their type tag is b. This type of file is meant for storage devices.
As you already know, directories have a type tag of d, and regular files have a hyphen as a type tag.

The text goes on for a couple of pages describing ASCII files, database file structures, and other concepts that are not useful to the objectives of the chapter. Let's move on to the Processing Files section that begins on page 160.

In this section, the text introduces a concept that relates to several programming languages and to operating system environments. Last week we talked a bit about data streams. Streams are communication channels between a program (or an OS) and its hardware/software environment. They exist for multiple programming languages and for the operating systems we are talking about. We generally care about three streams:

stdin - the standard input stream, typically fed by a user's keyboard and mouse; its stream number is 0
stdout - the standard output stream, typically feeding to a monitor and/or speakers; its stream number is 1
stderr - the stream used to notify users about errors, which is typically done on the same hardware used by the stdout stream; its stream number is 2

C, UNIX, and Linux treat all these streams like files, just like they do with everything else. They send data that is meant for a stream into it the same way they send data to any other file. The operating system you are running defines what the streams connect to.

The text explains that we can redirect the output of a process to a file (including the three streams) by using greater than symbols. Most processes take their input from a default location, such as stdin, but we can tell a process to take input from a file using less than symbols.

sending the output of a process to a file: command_name > filename
The example above would take the output of a command and write it to a file, overwriting that file if it already existed.
appending the output of a process to a file: command_name >> filename
The use of the double greater than signs in this example would take the output of a command and write it to the end of a file, creating that file if it does not already exist.
Collecting errors in a file: command_name 2>> filename
The example above would take any error messages generated by the command and write/append them to a file, creating that file if does not already exist. Note the use of the numeral 2, meaning file stream number 2, to collect only error messages. Had the operator been 1>>, the command would have captured standard output, but not errors, in the log file instead.

The text describes using the less than sign to use a file as an input source to a command. This is more useful in testing commands and scripts that in real life.

The video below runs through several commands that are discussed in this chapter and others you may find useful:

sudo
touch
echo
cat
mkdir
mv

ls
wget
find
stat
chmod
rm

Creating a file is most properly done by using a process or application meant to do so. However, if you really want to create a new, empty file, the text offers two ways to do so on page 162:

> filename
This command has no command name, which makes it look very wrong. In this case, the redirection operator is pointing to the name of a file. We are not calling any process to send output to the file, which should make you uneasy, but the technique works. A file is created with the name you supply, but it is created with no contents, because no input or output stream has been specified.
touch filename
The touch command can be used to update the last modified date of a file, or it can be used, as in this example, to create a new empty file. I have tried to use this command for this purpose, and it is not always successful. If you really want a file, use a method that puts something in it.

Deleting a file was mentioned in the last chapter. A summary of the rm (remove) command appears on page 163. The simple notation is just rm filename.

You already know that the rmdir command can be used to delete an empty directory. The text explains that the rm command may be used to delete a directory, but it will not work unless the command is issued as rm -r directoryname. The -r switch stands for recursive, which means to do it to all the contents of the folder as well. UNIX and Linux typically refuse to delete a directory through the rm command, empty or not. If you use rm -r, you are telling the system to delete all the files (and subdirectories, and their files, if they exist) inside the directory first. This makes rm -r a very powerful command. Use it with caution.

The text reviews the cp (copy) command on pages 164 and 165. It reminds us that directories are not allowed to contain two (or more) files with the same name. A copy sent to the same folder must have a new name. Several variations on cp are mentioned:

cp filename newfilename
This method allows you to have a backup copy of a file in the same directory as the original file.
cp filename path-to-a-folder
This method copies the original file to a new location. It omits a filename for the destination copy, which would result in using the same filename as the original. You could specify a different filename for the copy if you like.
cp directoryname/* otherdirectoryname
This method copies all the files in the source directory to the destination directory. Each of the directory specifiers should include path information if needed. When is it needed? When the directories are not children of the current directory.

On page 166, the text discusses the mv (move) command. Think of a move as being like a cut and paste operation, taking files from one folder and placing them in another. The syntax is like the cp command described above.

We have used the ls command to show us information about files whose location we know. The find command is used to locate files. The text explains the syntax like this:
find [pathname] [-name filename]
We are told that the find command will search recursively by default. This means that if you give it a directory to search, it will also search all subdirectories under the specified directory.
The -name switch enables you to specify a search string for the filename, including wildcards if needed. The -iname switch allows a search string, and it ignores case, making it a much more flexible search. Other options are described, which may be useful in specific searches.

The text describes another use for the cat command on pages 167 and 168, It proposes a situation in which you want to combine two text files into one. You could do this with the >> redirection operator, as you should remember, but the text suggests another way:
cat file1 file2 > file3
This syntax would read file1 into memory, append the contents of file2, then write the resulting file as file3.

The paste command is described as a utility that reads two files, line by line. It combines the first lines from each file, the second lines from each file, and so on, putting the resulting longer lines into a new file. As you can see from the example in the text, this would be best done with two files having the same number of lines, each line having information that is meant to be viewed in a column. This is one way to create a file that resembles a table in a database.

The cut command is explained as working best on tables of data. It can be thought of as a way to undo the creation of the table that you made in the previous example. As with the command above, it is a little hard to imagine wanting to do this kind of thing. You may want to think of it as the ancestor of similar functions found in actual database management programs. This is a way to do database reporting if you have data files but don't have the database software.

The sort command is used, as you might imagine, to sort the contents of a file in some meaningful way: alphabetic, numeric, or the reverse of either of those orders. Its options provide more choices about what data elements to sort by, assuming that we are dealing with a table of data.

The fourth objective for this lesson is about scripts, which I have mentioned to you several times. Most of our recent work has been about operations from the command line, using commands found in most shells. A shell script is a file in which you have saved one or more commands that you would like to run regularly, frequently, or reliably. If you find there is a sequence of commands that are useful to you, you may want to create a shell script holding those commands. If you use commands that are intricate and hard to remember, you may want a script for those commands so you can run them the same way each time, without having to worry about syntax errors. A script file would be useful enough if you could only use it for reference. It is more useful than that, however. Once a script is finished, the commands stored in it can be run in sequence simply by entering the name of the script as a command. We will hit some material about this in chapter 6. The text promises some practice in project 4-15, however, that project requires that you do several other projects first to create the data files it uses. A little much for this chapter. The lnk above will take you to a series of short lessons about creating, granting rights to, and using shell scripts. Take a look at them if you have time.

Let's skip ahead to the introduction to awk on page 176 to finish this chapter. Awk is more than a command. It is a programming language that can be used in Linux inside script files. The text provides a two page introduction to several features, and mentions that your version of Linux may come with gawk, an improved version of awk. For this lesson, I have placed two handouts/downloads about awk in the Week 4 folder on Canvas.

As an alternative, you can play the video below to learn to use some awk commands from Gary Explains.