Chapter 9

ITS 4350 - Disaster Recovery

Chapter 9, Disaster Recovery: Preparation and Implementation

Objectives:

This lesson is about chapter 9. Objectives important to this lesson:

Disaster recovery
Maintenance
Investigations

Concepts:

Chapter 9

The major concepts in this chapter were covered, to a large part, in the notes for last week. Let's start with some background that tries to put recovery operations in perspective.

Here are some alarming statistics from a similar text:

90% of companies having a data center disruption lasting 10 days or more went into bankruptcy
40% of companies that have disasters shut down and never reopen
30% of companies that have disasters fail within two years

The three statistics above should tell you to have a plan in place that will not let the disaster close your company.

Disasters can be classified based on some of their characteristics. Here are two points of view:

natural disaster or man-made disaster - Is the disaster caused by nature, or by an act of a human being? Consider the list in table 9-1. Most of the disasters listed are clearly natural (e.g. earthquake, tornado, tsunami) while others require more detail to determine their class. Can a fire be caused by a person? How about a flood? How about other kinds of mayhem? (Cue videos...)

Mayhem comes in both classifications: natural and man-made. The question is what can we do to avoid or minimize the mayhem?
Another point of view is time based. Does the disaster have a rapid onset or a slow onset? Is it more like a storm or more like global warming? This can lead to different types of responses, and should suggest different sorts of causes.

What should be crossing your mind now is why the category you place a disaster in should matter? Are the plans in the red books or the blue books depending on the classification? If that is how your organization works, go for it.

The text often offers the thrilling advice to back up your data. Yawn, and sip some more caffeine, it's a necessary part of business. There are a lot of solutions from a lot of vendors. This is a lecture from an industry professional with ten reasons to do it better than you are. Others did not, and it was not pretty.

The text changes topics to explain why backups of data are needed. Can't we just save everything in the cloud? "The cloud" is just somebody else's server, and it may be exposed to the same threats as our own equipment. Even the natural disasters that might happen to us should put fear into the best of us.

Natural disasters can disrupt any business, including data centers and shrimp boat businesses.

The text continues with some remarks that relate to business continuity, which means that we keep running the business even though we have had a problem, a setback, or a disaster. We should consider the author's remarks about the effects of a business disruption that might be reduced by detailed planning and action based on such planning.

Some definitions from the text may be helpful in understanding the point of the chapter.

business continuity - continuation of operations and services despite a disruptive event
continuity of operation - same as business continuity
risk assessment - analyzing risks, their effects, and what we can do to reduce the probability of their occurrence and of their effects
planning and testing - identification of risks and threats, creating plans to deal with them, and conducting tests of those plans
business impact analysis - identifying mission critical business functions to prioritize our continuity plans
IT contingency plan - a plan to continue to provide services in case a particular incident disrupts normal service; a separate plan will be needed for each type of incident that can occur
disaster recovery plan - restoring services provided by the enterprise to their standard state; this is not just about IT services

As we saw in the chapter about the CSIRT, the author thinks we should form a team for Disaster Recovery. I wonder if he has considered the possibility that it is the same team? In a small company, you should expect that this is so. You should form a group to handle the concerns of each item in the BIA, since those are the elements that you decided were important at the beginning of the planning process. If you need a separate group for each BIA element, you probably have a very complex organization. If one team can do the whole thing, good for them. You are better served if all the teams (however many there are) cooperate and communicate their intentions to each other. They are all building (or rebuilding) the same company. Management oversight should be quite thorough because the end result of this process is our new, improved organization.

You may want to look over NIST SP 800-34, Revision 1. This may be the first time we have been referred to this document, but it seems very much like the one that was used as a plan for the CSIRT team and its duties. Get support to make the plan, consult the BIA, propose and add controls to reduce the need for the plan, plan to end and repair the disaster, write out the plan, test the plan, then review and maintain the plan. Familiar enough?

Forensics

The last chapter did not say enough about forensics, so let's have a bit more from another text. A forensic investigation is typically one that concerns a crime. This section is about computer forensics, investigations into crimes that involve computers and other information system equipment. From one point of view, there are four aspects of an investigation:

secure the scene - The team that does this may be called an Incident Response Team or a Forensics Response Team, or another title that means the same thing. They are responsible for taking possession of devices that might hold any data that might contain evidence of the crime being investigated. In addition, they should photograph the scene, document their observations, and record interviews with witnesses. (Pick your favorite police procedural and study it a bit.)
preserve and collect the evidence - This aspect is closely related to the first, in that the response team may have to take images of data in RAM that would be lost if not recorded before the power is turned off. Note this Order of Volatility, which indicates the order in which to capture data from a running system:
- Register, cache, and peripheral memory first
- Random Access Memory (RAM) second
- Network state third
- Running processes fourth
establish (and maintain) the chain of custody - There must be a continuous documentation of who has had access to seized devices and data, who has done what with it, and who it is turned over to at each change in custody.
examine for legal evidence - Although other discussions have used the word "evidence" several times, this one brings up the point that not everything you find is actually legal evidence. Only things that indicate or prove a crime was or was not committed can be considered as evidence that will be presented in court.

Several texts elaborate on memory and storage locations that should be examined for meaningful data. What you can expect to find there may surprise you:

Windows page file - This is a hidden file, typically on the boot drive, that Windows uses to store "memory pages" that it thinks you are not using presently, like memory devoted to an application that is minimized. The file is probably in the root of the drive, and is probably called pagefile.sys. You should expect to see pieces of anything file that the computer worked on, especially if it was minimized while the user worked on something else.
RAM slack - This will take a minute. When Windows saves files, it saves to sectors (track sectors) on a drive. Sectors are logically arranged in clusters, which are the smallest storage area a file system can use. The number of sectors in a cluster varies depending on the way a drive was formatted. When a file is saved, it will take a certain number of clusters to hold it, but the file itself may not actually fill the last sector used in the last cluster used to store it. When this happens, older versions of Windows (before NT) did something you may never have heard about. They filled the last sector used for the file with data pulled randomly from RAM. This data is called RAM slack, a copy of a piece of RAM that has been stored in the slack space at the end of a sector. Why did it do this? Windows just worked that way: it had to fill the rest of the sector. You never knew what you'd find in it. Since NT, the RAM slack space has been filled with zeros, so this is less of a problem as time goes by, except for stand-alone, legacy systems that use older versions of Windows.

In the simplified illustration below, a file has been saved to two four-sector clusters, but it only fills six and a half sectors. The cluster marked in cyan is full: all four sectors have been used by the file. The second cluster is not full. The second half of the seventh sector (item F) is filled with RAM slack. The eighth sector has not been used at all, but we will cover that in a minute.

Note: for many years, a track sector has held a standard 512 bytes regardless of what device it was on. As of January 2011, this was no longer true. A device using Advanced Format on a system that understands it may use sectors that hold 4096 (4K) bytes. This leads to lots more room in a RAM slack situation. If a computer using such a device is running an older OS, the 512 byte limit for sectors still applies.
Drive slack - So, if the cluster holds a specific number of sectors, what if the file only used some of those sectors when it was saved? Does Windows fill the rest of those sectors, too? No, but something else interesting happens. If anything was ever written to those sectors before, it remains there undisturbed until there is a need to write to them. This means that some sectors at the end of a cluster may hold old data that the user thought was deleted. The data held in those sectors is called Drive slack. You never know what might be in it.

In the illustration below, the last sector of the second cluster (item G) is Drive slack. The new file has not overwritten whatever was in that sector already.