ITS4350A - Incident Response and Disaster Recovery


Chapter 10, Disaster Recovery; Chapter 11: Business Continuity

Objectives:

This lesson is about chapter 10. Objectives important to this lesson:

  1. Key challenges
  2. Preparing for DR
  3. Recovery phase
  4. Resumption phase
  5. Restoration phase
Concepts:

small red alertThis chapter begins with our text's illustrative company having a fire. No details are available, but a report is made by an employee who just arrived at the site and found out that a disaster has happened. I am reminded of a standard news teaser. In this case it might be "Breaking news! A fire has shut us down! Film at 11!"

Yes there is a time and a place for acting like Paul Revere, but that time passes quickly. Soon, you need more information to do anything worth doing. Our hero reports to a higher authority who acts on unconfirmed information to activate two plans. I hope they are flexible, and that they start with gathering confirmed information.

The text moves on to discuss a different scenario. Sometime the disaster that affects our organization is a bigger disaster that affects everyone near our location. When this happens, we have to take in the big picture. Our disaster plan may assume that standard services in our community (e.g. utilities, transportation services, communication services, sanitation services, standard vendor services) are all available outside our organization. This may not be so in the cases of power outages, huge storms, or worse disasters that affect large areas.The text references lessons learned since 2020 from the Covid outbreak. We should all remember what we learned, and how our assumptions had to change.

The list of natural disasters in the text is not exhaustive. For example, if you have followed the weather news for the last few years, you may have noticed that bad storms are more frequent than they have been in the past. Hurricanes and floods can happen in almost any coastal area. Snow storms, ice storms, and wind storms (or all three at once) can happen quickly and dangerously. An emergency consists of a problem that you are not ready for. Our goal must be to plan ahead, and to be ready for whatever we can imagine, and to adapt to what we did not imagine.

The text has talked about making plans in several chapters, so we will assume that subject matter experts have been consulted and plans have been created. The text moves on to discuss three distributions for our plans:

  • office/work location - A paper copy of each plan, plus electronic copies on critical systems (computers or networks). Keeping copies on portable devices would also protect accessibility.
  • out of the office - Responsible staff should have copies, paper and electronic, at their homes.
  • online - Depending on the nature of the disaster, electronic files may be accessible on remote systems, which may be web sites that are not hosted at the work location, or storage sites hosted by outside providers.

The text continues with a list of trigger events that can lead to implementation of a plan:

  • management decision - Management may notify all staff or key staff that an event has occurred (or will occur soon), and that we are beginning to run a named/numbered plan.
  • employee notification - An employee may notify management that an event is in progress, which will cause a plan to be put into use. Management decisions are typically required, but the employee notification is the trigger, and the employee may have to begin the plan implementation if the usual authority is not available.
  • emergency management - A state or federal agency may declare that an emergency exists, which may trigger a related plan for our organization.
  • local emergency - As in the example in the text, a fire or other local disaster my affect our organization, causing us to use an emergency plan.
  • media (news) outlet - A responsible news entity may announce that an emergency, a disaster, or an act of terrorism has occurred. If the event involves our organization, this should also trigger the use of an emergency plan.

As part of the discussion of what to plan for, the author gives us a list of teams that might be needed during the disaster. The list gives us something to think about, regarding what they will do for us and how they will make the situation better.

  • Disaster management team
  • Communications team
  • Computer hardware recovery team
  • Systems recovery team
  • Network recovery team
  • Storage recovery team
  • Applications recovery team
  • Data management team
  • Vendor contact team
  • Damage assessment and salvage team
  • Business interface team
  • Logistics team

The only one that seems to need an explanation is the Business Interface team. They are the interface between the IT department and the rest of the organization. Some of what they do might be included in other teams, depending on how you set up the teams.

The chapter concludes with a series of phases that follow the trigger event.

  • Response phase - The people dealing with the disaster contain it and protect resources according to the plan's hierarchy, which will probably match the one explained above.
  • Recovery phase - The things that keep us in business are recovered first, as addressed in your business continuity plan, or your disaster recovery plan if there is no BCP.
  • Resumption phase - Having dealt with the BCP, the other items in the business impact plan are addressed by recovering them in an order dictated by their dependencies.
  • Restoration phase - This phase has more to do with restoration of the business location. Restore, rebuilt, or relocate? It depends greatly on what went wrong and how badly the original site was damaged.


Chapter 11, Business Continuity Planning

Objectives:

This lesson is about chapter 11. Objectives important to this lesson:

  1. Key elements
  2. Who to include
  3. Construction of the plan
  4. Elements for effective plans
  5. Activating a BC plan
  6. Maintenance and improvement
Concepts:

Condition greenThe chapter begins by taking us back in time, nine months before the fire that started at the beginning of the last chapter. All is well, and the business continuity team is inspecting a possible operations site for emergency use.

The text compares this chapter to chapter 9, which was about planning for disaster recovery. It notes that the business continuity plan has common features, but the two plans have different goals. Disaster recovery is about moving from a temporary site back to our original or new operations site. Business continuity is about resumption of interrupted operations at the temporary site. They are both concerned with continuing operations, but the disaster recovery plan assumes that operations are currently running.

The text spends its usual number of pages recommending who should be on the planning committee, reminding us to include decision makers as well as people who know how the work is actually done. This includes technical work (e.g. hardware, software, maintenance) and the actual work of the organization. The text continues with its templates about creating the plan and testing it, reminding us to include controls to avoid our needing to activate the plan. People charged with creating such a plan may forget that the best approach to dealing with a disaster is to prevent its occurrence. I am reminded of a British car company's commercial from several years ago (no video available, unfortunately) that discussed car crash tests, and observed at the end that "in England, we endeavor to miss a wall". Let's follow that model, where we can.

The creation of a business continuity plan is complicated by the necessity of dealing with an outside entity, whether that entity is another office in the same organization or an external space/service provider. The text proposes a plan that starts with moving some people out of their usual location into another one that is still within the confines of our owned/managed space. It goes through six levels, concluding with moving everyone in the organization to "an external, distant location". Obviously, the logistical complications become more involved with each increase in the number of people to move and the distance to move them. Each level of complexity requires a separate plan.

Expanding on that idea, the text tells us to make sure we accurately measure the level of damage and the number of people affected by the disaster at hand. When a low complexity plan is engaged, we need to be sure that we have not made a mistake, moving an insufficient number of people an insufficient distance. Likewise, we don't want to overreact to the disaster, moving staff who do not actually need to move. The move will disrupt business if it is not needed, which is the opposite of our intent.

The state of Michigan, for example, often encounters incidents that cause staff to be moved, temporarily, from one location to another. This is often due to a power outage at their usual location. When this occurs, it is necessary to take as many of the measures that the text lays out as the incident calls for. Notify staff, notify the media, notify customers, and continue notification as the situation changes. The state is a large enough organization that staff of most departments can move for a short time to space in other state buildings, usually buildings occupied by the same department. When this is not possible, higher level plans are used. The plan for Covid grew to include many workers at remote locations all week long, and now includes work both at home and at "permanent" locations.

The text presents a number of considerations that will be encountered by any organization moving a significant percentage of its operations. Space, equipment, and services cover most of the worries. On several pages in the chapter, the text discusses the actions of "the Advance Party". In one respect, the committee that secures that alternate site is the advance party. They make the first inspection, long before an incident occurs, and they arrange for expected services, equipment, and space. In another respect, that inspection and arrangement only sets up an expectation. The reality is what the first people we send to an alternate location find waiting for them. If we know that the location will have equipment that needs to be set up (warm site?), that first group needs to include the technicians who can do that job. The second wave does not need to arrive until some of that set up activity has been done.

As usual, the author stresses testing the plan as a read-through, as a table-top exercise, as a walk-through, and as a live exercise.

A thought to take from recent history is that people are always in motion. It is natural to be going somewhere else. What's hard is to move everyone somewhere at once. Maybe we would do better to embrace the idea of working remotely more often, so that we can continue our operations regardless of where the staff happen to be.

Whatever your plan turns out to be, be flexible and make it work. Improve the plan as you go, if you can, and improve it for next time if you can't.