ITS 4350 - Disaster Recovery


Chapter 2, Planning for Organizational Readiness

Objectives:

This lesson presents the first chapter about contingency planning. Objectives important to this lesson:

  1. Support of management
  2. Forming a planning committee
  3. Business impact analysis
  4. Collecting data for a BIA
  5. Budgeting contingency operations
Concepts:
Chapter 2

If you browse the scenario that opens chapter 2, you will see several people role playing a contingency. They are doing it in two parts. One is to run through a plan that is already in their operations manual. The other part is to involve the people in the room, to develop questions about the planned response, and to determine whether they are doing what should be done. In this case, they have a plan for a contingency, they are walking through the plan, and they are learning about the merits of the plan. They should not just be reading their assigned parts. They should be thinking about the reality of doing what they are assigned and proposing amendments to improve the plan.

The text talks about forming a body that will be responsible for creating contingency plans. That body has a number of duties, some of which should be done before it starts;

  • Obtaining commitment from senior management - There needs to be a commitment to empower the committee to even form, to do its job, to examine its output, to revise output as needed, and to require compliance from all employees and departments to accept and follow the plans. And to allow for improvisation as needed; things never go exactly as planned.
  • Managing the contingency planning process - Assign members and other staff to gather information and assemble procedures.
  • Writing the master document - There needs to be a place to start, and once it starts, there may be area/division specific procedures that will be written by subject matter specialists.
  • Conducting the business impact analysis - document threats, vulnerabilities, and attacks, and relating them to documented business functions.
  • Develop teams to create manuals for incident response, business continuity, disaster recovery, and crisis management.

The text offers suggestions about the kinds of knowledge that will be necessary to create the contingency plans. Note the section about representatives from other business units, which include the company's actual business, IT management, and IT security management. Remember that these are the three axes of the CNSS security model. On page 52, the text tells us that there also needs to be commitment from management in these areas to create useful plans that will be available when they are needed.

In the scenario at the beginning of the chapter, we are left wondering whether the manuals being used by people from different work areas were identical, or different in some ways. There should be specific pages for some staff, depending mostly on the specialization or security level of their jobs. However, there should also be a master copy with all the sections in case someone needs to step in to do another person's job.

The text delivers a lot of details about a lot of details for several pages. Moving on to more focused material, let's continue on page 57 where we see five "Keys to BIA Success":

  1. Set the scope to cover the necessary work areas of the organization for each risk to be addressed.
  2. Get information from experts in your organization about the impact an exploit will have.
  3. Keep the information factual where possible, to avoid opinions that may be mistaken.
  4. Determine the areas you need to report on before gathering the data. This will save time, and be of more interest to approvers.
  5. Get approval of your BIA and your risk assessment. Without it, your process will stop there.

The first of three phases of a business impact assessment begins on page 58 with making a list of the critical processes/functions of the organization you are studying. In the first column of the example chart on page 59, you see seven functions performed by company. (The text notes that this chart of seven functions is an example, and that a real chart for a real company would be much longer.)

Function Profitability:40% Strategic Value:30% Internal Ops:20% Public Image:10% Total Weighted Score (100%)
New business 8 8 3 6 6.8
Maintain old business 8 7 6 7 7.2
ISP service 10 8 4 8 8
Internet services 9 10 4 8 8.2
Help desk 5 6 6 8 5.8
Advertising services 6 9 4 9 6.8
Public relations 4 6 2 10 4.8

In the columns to the right of each function name, four kinds of impact are considered. Each kind of impact has been given a percentage rating, reflecting how much the company cares about it. Note that the sum of the percentages is 100%. If it were not, we would have to assume we are not measuring the impacts correctly, or we are measuring the wrong impacts.

Each function is rated on a scale, probably from 1 to 10, on how much its loss would affect each kind of impact. Note that the columns and the rows do not add up to a specific value. There is no presumption that they should. To get the weighted score for a function, its raw score for each column is multiplied times the percentage for that column, and each of those weighted scores is added together. The weighted score for each function is a measure of its criticality to the company's ongoing business.

In the chart above, I have marked the three functions that have the highest weighted scores. They are the ones we need to protect the most. Preparing a chart of this sort, or a series of them, leads to our knowing which business functions should get our attention first and foremost.

Page 60 brings up a related idea. Some functions will have a different criticality when we consider which to bring back from a damaged or disabled state first. Some functions rely on other functions to make them possible. In cases like this, it is necessary to prioritize the recovery of any service that a critical service depends on. With that in mind, it makes sense to consider the downtime metrics discussed in text. (This is considered further in the table on page 63.)

  • Maximum Tolerable Downtime (MTD) - In the text, we see the example of a system that can only be down as much as 4 hours in a month. A more specific tolerance would state that the four hours need to be scheduled as one hour each week, on a day or shift during with the system is not needed. There is a difference between that standard, and four hours all at once, and four hours spread evenly across a month. This measure of time refers to normal operations.
  • Recovery Time Objective (RTO) - This time is similar to the measure above, but it refers to the time a system can be allowed to be down during a recovery. It assumes that a disaster has occurred, that emergency procedures are in effect, and that we intend to restore this system to normal operation. Shorter times are assigned to this measure for systems that are critical to our operations. The more critical a system is, the less time we can do without it.
  • Recovery Point Objective (RPO) - This one is harder. It is a measure of how current the most recent backup is, and how much data we can expect to have to load to that backup to become current again. The text expresses it as a number of work hours that will need to be captured and added to the most recent backup once it is restored. In the discussion page 61, the text gives us another way to look at it. It calls the RPO the amount of data we can do without during the recovery process.

In the small chart on page 62, the text shows us two curves plotted on the same graph. The Cost to Recover is highest if we have a constant live copy of the data, and lowest if we use an old-fashioned tape backup system. The old system looks good until we note that the time it takes to use it causes a much longer disruption time, which has associated costs that go higher the longer the disruption takes.

The simplified version of this chart, shown on the right, makes the same point, but is may be easier to see. The projected cost to the organization in any scenario using these curves is the total cost of the red and the blue lines at any given point.

Reasonable expectations about tolerable downtime and recovery time lead to a compromise that the text shows as the Cost Balance Point. If we spend more on our recovery system, we can expect less time that the system will be down, lowering the costs that down time creates. You need to plot those two curves for your own organization to determine what the best choice is.

The text returns to the idea of determining the priority of each process to the business, and each system or asset to the functions that use or depend on it. This is discussed again on page 63. At this point, you would think you would know a lot about the company's assets, functions, and priorities. the text turns to data collection on page 65. eight data collection methods are listed, then discussed for the next twelve pages. Which one is best? The one that gets you to the truth. Keep that in mind. People often answer questions with either your agenda or their own in mind. Ask without pressure, and you may get better responses.

The last major topic in the chapter is budgeting. The text lists four operations that will require a budget. They are four familiar areas by now.

  • incident response - The text points out that this is a normal, expected IT operation. Not every incident is a wide-spread emergency, but they all need to be handled.
  • disaster recovery - The text proposes that the largest cost of recovery is insurance. The best recommendation we can make about it may be to consult industry associations to determine what are considered to be best practices.
  • business continuity - This one requires you to estimate and collect money for extra locations, employees, equipment, and data devices, to be used during emergencies.
  • crisis management - This item concerns large scale disasters that bring huge physical losses and long term psychological damage. It may also concern more predictable losses and expenses to employees, such as funeral costs. A lot depends on the benefit packages your employees may already have.

Assignments

The assignment for this week is to do the Business Impact Analysis assignment for Module 2.