ITS 4350 - Disaster Recovery


Chapter 3, Contingency Strategies for IR/DR/BC

Objectives:

This lesson is about chapter 3. The chapter is divided into two parts, each with its own objective. Objectives important to this lesson:

  1. Data and application resumption
  2. Site resumption
Concepts:
Chapter 3

The scenario that opens chapter 3 leads us to a convincing story. A mail room employee spills coffee, mops up the mess, and breaks open an envelope filled with white powder. At this moment, we readers should pause to express our gratitude that someone at this company anticipated this event, planned actions to take, trained staff, and made everyone familiar enough with the plan that its execution began correctly and continued without a reported glitch.

The title of the chapter is explained briefly with a review of earlier material. Incident Response (IR) is what we do when something happens. Business Continuity (BC) is what we do to keep the business operating while can't operate the way we normally do. Disaster Recovery (DR) is what we do to return to normal operations.

The first half of the chapter is about data and application resumption, methods that help us when we lose or lose access to data and applications. On page 93, the primary topic is backup and retention strategy: what kind of backup do we make, where do we keep it, and how long will we keep it? The text suggests you should find out if there are legal requirements you must follow before you set a policy that winds up being useless.

In the example table on page 93, there are three levels of priority for data: low, moderate, and high. Each higher level represents data that is more important to the organization, whose loss would be more damaging, and whose replacement is needed sooner than data on lower levels.

Priority
Backup/Recovery
Low. We might need it eventually.
Tape backup, and put it where we can find it when we have time, at a cold site.
Moderate. We need it, but we can bring it back in a little while.
Optical disc, copied over a WAN, stored in a warm site if the boss cares, a cold one if not.
High. We need this right away.
Live copy (mirror) system, advanced RAID, kept at a hot site.

In case you are not aware, a cold site may just be office space with a potential for computers and data. A warm site has computers that need to be brought up and loaded from recent backups. A hot site has computers that are running, with our data already available on them.

The text considers online backups and cloud services on page 94. It points out that free services usually have no guarantees that they will be available, and they often have file space limits. Cloud storage is simply storing data on someone else's computer, accessed across the Internet or through a dedicated data line. Cloud computing is different: it means that you are running an application on someone else's computer. For example, you may be running virtual machines on computers owned by Amazon. That could be part of a business continuity plan, or it could be a regular part of your business operation.

Regarding cloud storage, the text list three categories that may be helpful:

  • public cloud - service is available to the public over Internet connections
  • community cloud - a shared solution using common equipment that is not accessible by the public; example: a shared cloud for local and county government offices, funded by each of them for their common use
  • private cloud - a solution that is run by and for a particular organization, but accessible without members having to be on the corporate network to use it
The text discusses modified backup strategies, and most seem to be layered concepts, such as backups going to a local cluster of drives before being sent to tape or cloud storage. This leads to the discussion of some classic backup strategies. Each strategy is the same regardless of the medium being used. It doesn't matter whether we are using tape, disc, hard drive, or external storage. First some terms, and the names of the strategies:
  • Target - the device, volume, folder, or group of files being backed up; the source of the material in a backup operation
  • Archive bit - a binary digit in a file that is turned ON when the file is changed; it is used to flag files that have changed since the last backup; most backup programs look for files whose archive bits are set to ON, copy those files, then reset the archive bits (turn them OFF) on the target files
  • Full - a backup of all files in the target; sets the archive bit of each file to OFF once the backup is made; your text assumes you know what a full backup is
  • Incremental - a backup of target files that are new or changed since the last backup; depends on the fact that programs that change files typically set the archive bit to ON when a change is made; sets archive bit to OFF for all files it copies
  • Differential - a backup of all files new or changed since the last Full backup; copies all files whose archive bit is set to ON; does not change the archive bit of files it copies because they will be copied again in the next differential backup
  • Copy - like a Full backup, but it does not change the archive bits of files it copies. This is typically not part of a standard backup strategy, but an option to work around the system.

Assume we use a "tape drive" to make backups. In a Full backup strategy, the entire target is backed up to tape every time we make a backup tape. This strategy consumes the most time and the most tapes to carry out a backup. To restore, we simply restore the most recent tape(s). This is the least time consuming strategy for restoring, but the most time consuming for creating backups.

The second method, Incremental backup, means that we start with a Full backup of the target, and then each successive backup tape we create only backs up the elements that are new or changed since the last backup was created. This means that successive backups will not always be the same length. Therefore, this is the least time consuming backup, but the most time consuming restore. To restore, we must first restore the last Full backup made, and then restore EVERY tape made since then, to ensure getting all changes.

The third strategy, Differential backup, also starts with a Full backup tape. Then each successive tape made will contain all the files changed since the last Full backup was made. This means that we will have to restore only one or two tapes in a restore operation. If the last tape made was a Full tape, we restore only that one. If the last tape made was a Differential tape, we restore the last Full tape, then the last Differential tape.

The fourth strategy, Copy, is no different from Full in terms of backup or restore time, assuming it is a full copy. In both Incremental and Differential backup strategies, you will typically use a rotation schedule. For example, you could have a one week cycle. Once a week, you make a Full backup, then every day after that you make the other kind you have chosen to use: Incremental or Differential.

To keep them straight in your mind, remember these facts:

Backup type What does it back up? What does it do to the archive bit?
Full copies everything Resets all archive bits in the target set.
Incremental everything different from the last backup Resets the archive bits of the target files it copies.
Differential copies everything "different from Full"
(Different from the last Full backup.)
Does not reset any archive bits.
Copy makes a Full or selected items backup Does not reset any archive bits.

The time required to create backups should be considered along with the time to restore a backup. When you consider the two concepts as two sides of the answer to a question (What method should I use?), the answer may be the most common choice: Differential. It is the best compromise in terms of backup time versus restore time. Note also, that all standard methods require a full backup on a regular cycle. The recommendation is usually to run a Full backup weekly.

The text discusses fault tolerance, by which it means the ability of a system to tolerate the failure of a part. In particular, it is concerned about the failure of a hard drive that holds important data. Systems that provide tolerance for this kind of event typically use a form of RAID, which has been defined several ways. Eventually, all hard drives fail. RAID allows a system to continue in most cases. One common meaning is Redundant Array of Independent Drives. The word "independent" seems unnecessary, and is in fact misleading. Hard drives set up in a RAID array perform functions that relate to each other. Several kinds of RAID exist to provide for redundant storage of data or to provide for a means to recover lost data. The text lists several types and discusses a few. Follow the link below to a nice summary of RAID level features not listed in these notes, as well as helpful animations to show how they work. Note that RAID 0 does not provide fault tolerance, the ability to survive a device failure. It only improves read and write times.

RAID levels and features:

  • RAID 0: Disk striping - writes to multiple disks, does not provide fault tolerance. Performance is increased, because each successive block of data in a stream is written to the next device in the array. Failure of one device will affect all data. This will provide a performance enhancement by striping data across multiple disks. This will not improve fault tolerance, it will in fact decrease fault tolerance.
  • RAID 1: Mirroring and Duplexing - provides fault tolerance by writing the same data to two drives. Two mirrored drives use the same controller card. Two duplexed drives each have their own controller card. Aside from that difference, mirroring and duplexing are the same: Two drives are set up so that each is a copy of the other. If one fails, the other is available.
  • RAID 2: Disk striping with parity. Not widely used. Neither is RAID 3 or RAID 4.
  • RAID 5: Parity saved separately from data - Provides fault tolerance by a different method. Data is striped across several drives, but parity data for each stripe is saved on a drive that does not hold data for that stripe. Workstations cannot use this method. It is only supported by server operating systems.
  • RAID 0+1: Striping and Mirroring - uses a striped array like RAID 0, but mirrors the striped array onto another array, kind of like RAID 1

The chapter mentions that backups for database systems are more complex, partly due to needing to take them down for the backup, and partly due to needing to use a specialized backup program that maintains the relations between data elements. For a database that is in constant use, a lock and copy method is not practical. It is better to pursue a live copy method, such as the continuous database protection scheme mentioned on page 101.

Backup plans should be part or normal operations, but recovery plans are typically part of contingency planning. Since recovery cannot be done if backups were never done, the text includes both concepts in this chapter.

Lease data line conceptAnother new concept is on page 103. Electronic vaulting is off-site storage of large volumes of your data. It may use old fashioned leased or dedicated data lines. In the images on the right, users at multiple locations are using data lines whose use has been purchased from a telephone company. This was a common method before people began to rely on the Internet for transmission of public and private data.

Note that the general concept of passing data in this image is not really different from using Internet connections, except that there is a guarantee of quality of service and bandwidth available on a leased line. The text remarks that electronic vaulting will probably be slower than local solutions due to the WAN links involved in it. Leased lines with sufficient bandwidth can overcome this problem.

A less detailed but more robust solution is Remote Journaling, discussed on page 105. This solution is a transaction recording and copying system, so it tracks transactions on your system, but it does not copy your entire database. In order to restore to a lost state using the journal would also require a reference copy of the databases and systems whose transactions were saved in the journal.

A more complete solution is discussed on page 106. Database (Databank) Shadowing records transactions, and it also keeps a copy of the relevant data, so it is more like a live copy of your operational system.

The text briefly discusses NAS and SAN, two technologies that provide more storage solutions to network users. Network Attached Storage is typically provided by a device that is added to the network, or by a dedicated server that provides storage space to users. Standard network protocols are used to read and write to the new storage space, and access can be granted by normal means. There is typically some latency in these systems.

Storage Area Network technology requires direct connection to a dedicated storage network, typically through wide bandwidth connections. Only users connected to the SAN can use it.

The author closes this section of the chapter with a discussion of virtual servers. A virtual machine is like a program that runs on an actual machine. The virtual machine can run any operating system that the hardware of the actual (host) machine can support. The attraction to doing this is that you can run several virtual machines on one well equipped host machine, and if any of them go down, they can be brought back up very quickly without having to worry about the damage to the OS that might have happened on a dedicated server. Virtual machines typically run in a memory management environment provided by one of the three products listed on page 109:

  • Microsoft's Virtual Server
  • VMware's VMware Server
  • Oracle VM VirtualBox

Assignments

The assignment for this week is to do the Exercise 3-1 at the end of chapter 3.