NET 121b: Essentials of Networking

Chapter 20: Fault Tolerance and Disaster Recovery


This chapter discusses methods used to provide specific kinds of backup services. The topics of this chapter are:

  1. Fault Tolerance
  2. Active Directory and NDS replication
  3. System backups and UPS usage
Fault Tolerance

The chapter begins with a discussion of the concept of fault tolerance. A fault can be the loss of a device or component, or a disaster that shuts down your operation. Making your network fault tolerant means to take steps that allow you to maintain service when small problems crop up, and to restore service quickly when large problems occur.

The chapter changes topic to discuss RAIDs. The acronym is defined differently by different sources. It is often defined as your text does it: Redundant Array of Independent Disks and Redundant Array of Inexpensive Disks are both acceptable, and both wrong in different ways. The disk drives may not be inexpensive, and some RAID configurations do not allow you to think of the drives as independent. What both mean is several hard drives that are linked together for improved performance. The improvement varies with each design.

RAID levels and features:

  • RAID 0, Disk striping - writes to multiple disks, does not provide fault tolerance. Performance is increased, because each successive block of data in a stream is written to the next device in the array. Failure of one device will affect all data. This will provide a performance enhancement by striping data across multiple disks. This will not improve fault tolerance: it will decrease fault tolerance.
  • RAID 1, Mirroring and Duplexing - provides fault tolerance by writing the same data to two drives. Two mirrored drives use the same controller. Two duplexed drives each have their own controller. Data is written to the second drive as a live copy of the first. This leads to fault tolerance, but decreased performance.
  • RAID 4, Disk striping with large blocks - data for each file is written to one drive; parity data for all similar platters across drives is stored on a parity drive. For example, assume that you have five drives,A through E, each with four platters, 1 through 4. Data would be written to drives A through D. Parity data for the set-1 platters is written to platter 1 on drive E, parity data for the set-2 platters is written to platter 2 on drive E, etc.
  • RAID 5, Disk striping with parity - data is striped across all disks in the array except the drive that stores parity for that platter set. Assume the example above. Data could be written to the set-1 platters of drives A through D, and parity data for them saved on the set-1 platter of drive E. For the set-2 platters, we change the locations, and write data to drives A, B, C and E, and write parity data to platter 2 of drive D. This moves the parity data for separate platter sets to different drives. Follow the link above to view the animated explanation of this type of RAID.

The text discusses mirroring and duplexing in more detail, listing some pros and cons for each.

Pro Con
Performance is better than striping with parity. You write the data twice, so you spend twice as much on drives, compared to not doing it.
Splitting up the drives causes no loss of data. If the controller fails, both drives are out of service.
Provides tolerance for read errors and drive failures.  

Pro Con
Performance is better than striping with parity. You write the data twice, so you spend twice as much on drives, compared to not doing it.
Splitting up the drives causes no loss of data.  
If a controller fails, the other drive should not be affected.  
Provides tolerance for read errors, drive failures, and controller failures.  

Compared to the two scenarios above, striping with parity provides less loss of storage space. Duplexing and mirroring sacrifice one drive for each drive in use. Striping uses parity information for recovery, not a full copy of the data, so you "lose" one drive for each array, not one for one.

Active Directory and NDS replication

A partition is defined as a subsection of a Directory, whether it is an eDirectory or an Active Directory. In an Active Directory network, every domain controller in a given domain holds a copy of the same information as every other domain controller. When changes are made, they replicate those changes to the other domain controllers holding their Active Directory.

In a Novell network, we divide the Directory Tree into partitions by using containers as markers of where partitions start. A partition must have a container at the top, and may contain other containers as well. The partition is referred to by the name of the highest container in it. In the diagram below, the first partition we see is the [Root] partition. It is drawn so that it contains the [Root], the highest container object in the Tree. (This is not the only way to partition a Tree, just an example. The [Root] object need not be in partition by itself.)

Any partition also contains all objects inside the containers it contains, unless another partition is made as a child of the first one. For example, the [Root] partition is the default partition in an NDS Tree. If it had been left alone, the [Root] partition below would have contained all objects in the Tree. However, a child partition was created: the EMA partition. We refer to the EMA partition as the child of the [Root] partition, since the EMA partition branched from the [Root] partition. This also makes the [Root] partition the parent partition of the EMA partition. In this example, the EMA partition also has two child partitions: the NYC partition and the Tokyo partition.

The topmost container in a partition is called the partition root. This is true for any partition. In the diagram above, we see that there are four partitions. The first is the [Root] partition, which contains [Root] only. The other three partitions are named for their partition root objects, the three containers at the top of each one. Note the naming standard:

  • All partitions are named for the topmost container in them, the partition root.
  • Only one partition may be called the [Root] partition, the one with [Root] at its highest point.

It may be clearer if you think of the phrase "partition root" as really meaning the "partition's root", the place where we drew a line in the Tree and said it begins a new partition.

Now for the concept of Replicas. A Directory partition contains a lot of information, and it would be a shame to lose it, so Novell invented four kinds of replicas, most of which are copies of a given partition.

  • Master replica - a complete copy of a partition. The original copy is a master copy. Another copy may be promoted to master status, if the original master is damaged. You may only have one master replica of any partition at any given time. Changes made elsewhere are passed here during NDS synchronization, and reconciled. This replica can be used to make changes to the partition and to objects in it.
  • Read/Write replicas - also a complete copy of the partition. There can be several of these copies, and two are created by default (if you have enough servers). This replica can be used to make changes to objects in it. Any partition changes attempted in it are passed to the master replica as requests. Object changes made here are passed to the master replica during NDS synchronization, and reconciled there. This replica can become a master replica if the master is damaged. It can also become a Read-only replica.
  • Read-only replica - also a complete copy of the partition. Multiples are allowed. All changes requested here are not made here, but passed to the nearest Master or Read/Write replica. Login to a Read-only partition is possible, but not directly supported, as the request is passed on as above. This replica may become a Master replica or a Read/Write replica. This replica receives changes made in other replicas during synchronization. Novell does not recommend making this kind of replica.
  • Subordinate Reference - this is not a complete copy. It is not really a copy at all, just a pointer to a copy of one of the above types. Subordinate References are created automatically. Any server that has a replica of a parent partition, and no replica of the parent's child, will be given a Subordinate Reference to the child. This replica does not support any changes, but forwards any request to a Master or Read/Write replica.

Changes to data come in two types: simple changes and complex changes. A simple change could be a change that affects the data in one object. This change needs to be replicated to all copies of the NDS partition that the object exists in. A simple change is replicated easily, and the replicas synchronize quickly. A complex change would affect multiple objects, such as creating a new partition from two smaller ones. Much data has to be replicated so synchronization may take much longer.

Three utilities are used for maintenance of NDS:

  • NDS Manager - use this from the Tools menu of NetWare Administrator
  • DSREPAIR - an NLM to be run on the server
  • SET NDS TRACE - this is a command run on the server
System backups and UPS usage

Making backup copies of data can be done several ways. The RAID 1 concept of mirroring a drive is one example. Another is making backups on tape, CDs, or other media.

  • tape drives - internal or external, intended to make copies of hard drives
  • removable hard drives - can be used as primary hard drive or to make a copy of the primary
  • jump drives - typically small compared to hard drives, often ranging from 16 MB to 1 GB, used more for convenience to transport files
  • Zip drives - typically 100 MB or 250 MB capacity, displaced by jump drives due to their ease of use (no special hardware required)
  • RAID systems - Follow the link above a nice summary of RAID level features not listed in these notes.

Four backup strategies, or schedules, are often encountered. You should know them. First some terms:

  • Full - a backup of all files in the target; sets the archive bit of each file to OFF
  • Incremental - a backup of target files that are new or changed since the last backup; depends on the fact that programs that change files typically set the archive bit to ON when a change is made; sets archive bit to OFF for all files it copies
  • Differential - a backup of all files new or changed since the last Full backup; copies all files whose archive bit is set to ON; does not change the archive bit of files it copies
  • Copy - like a Full backup, but does not change the archive bit of files it copies. This is typically not part of a standard backup strategy, but an option to work around the system.

This needs more explanation. Assume we use a tape drive to make backups. In a Full backup strategy, the entire target is backed up to tape every time we make a backup tape. This strategy consumes the most time and the most tapes to carry out a backup. To restore, we simply restore the most recent tape(s). This is the least time consuming strategy for restoring, but the most time consuming for creating backups.

The second method, Incremental backup, means that we start with a Full backup of the target, and then each successive backup tape we create only backs up the elements that are new or changed since the last backup was created. This means that successive backups will not always be the same length. Therefore, this is the least time consuming backup, but the most time consuming restore. To restore, we must first restore the last Full backup made, and then restore EVERY tape made since then, to ensure getting all changes.

The third strategy, Differential backup, also starts with a Full backup tape. Then each successive tape made will contain all the files changed since the last Full backup was made. This means that we will have to restore only one or two tapes in a restore operation. If the last tape made was a Full tape, we restore only that one. If the last tape made was a Differential tape, we restore the last Full tape, then the last Differential tape.

In both Incremental and Differential backup strategies, you will typically use a rotation schedule. For example, you could have a one week cycle. Once a week, you make a Full backup, then every day after that you make the other kind you have chosen to use: Incremental or Differential.

To keep them straight in your mind, remember that:

  • a Full backup copies everything. Resets all archive bits.
  • an Incremental backup copies everything different from the last backup. Resets the archive bits of files it copies.
  • a Differential copies everything "different from Full". (Different from the last Full backup.) Does not reset any archive bits.
  • a Copy makes a Full backup, and does not reset any archive bits.

The time required to create backup tapes should be considered along with the time to restore a backup. When you consider the two concepts as two sides of the answer to a question (What method should I use?), the answer may be the most common choice: Differential. It is the best compromise in terms of backup time versus restore time. Note also, that all three standard methods require a full backup on a regular cycle. The recommendation is usually to run a Full backup tape weekly.

Whichever backup strategy you use, you should consider keeping one set of backups in secure location at your site (handy and protected) and another set in a secure location at a distant site. Consider the potential disasters that could occur at your location (fire, flood, tornado, hurricane, vandalism, etc.) and decide how to protect your backups and how far away the other sets should be.

A true disaster recovery plan will include access to a site to use if your data center becomes unusable, unavailable, or nonexistent. The text describes such alternate sites as falling into three categories:

  • cold - a location without hardware, software, or data. You will have to bring in staff, equipment, software, and your off site backups.
  • warm - a location that has equipment to begin the job of cloning your data center. It may need additional equipment. It will definitely need your off site backup sets and software before staff can begin using it as your alternate data center. Requires less start up time than a cold site.
  • hot - a site that matches your hardware and software. It needs your off site backup sets to be restored, and then it is ready to go. This is the most expensive option.

The text suggests some hardware that is useful in maintaining system up time and data reliability. An Uninterruptible Power Supply (UPS) is typically recommended for all servers. Various models of UPS are available with differing capabilities. The main differences between them are how many minutes of battery power they can supply to your system in case of a power failure, and how many watts they can deliver for that period of time.