Chapter 10, Disaster Recovery; Chapter 11: Business Continuity
Objectives:
This lesson is about chapter 10. Objectives important to this
lesson:
Key challenges
Preparing for DR
Recovery phase
Resumption phase
Restoration phase
Concepts:
This chapter begins with our
text's illustrative company having a fire. No details are
available, but a report is made by an employee who just arrived at
the site and found out that a disaster has happened. I am reminded
of a standard news teaser. In this case it might be "Breaking
news! A fire has shut us down! Film at 11!"
Yes there is a time and a place for acting like Paul Revere, but
that time passes quickly. Soon, you need more information to do
anything worth doing. Our hero reports to a higher authority who
acts on unconfirmed information to activate two plans. I hope they
are flexible, and that they start with gathering confirmed
information.
The text moves on to discuss a different scenario. Sometime the
disaster that affects our organization is a bigger disaster that
affects everyone near our location. When this happens, we have to
take in the big picture. Our disaster plan may assume that standard
services in our community (e.g. utilities, transportation services,
communication services, sanitation services, standard vendor
services) are all available outside our organization. This may not
be so in the cases of power outages, huge storms, or worse disasters
that affect large areas.The text references lessons learned since
2020 from the Covid outbreak. We should all remember what we
learned, and how our assumptions had to change.
The list of natural disasters in the text is not exhaustive. For
example, if you have followed the weather news for the last few
years, you may have noticed that bad storms are more frequent than
they have been in the past. Hurricanes and floods can happen in
almost any coastal area. Snow storms, ice storms, and wind storms
(or all three at once) can happen quickly and dangerously. An emergency
consists of a problem that you are not ready
for. Our goal must be to plan ahead, and to be ready for whatever
we can imagine, and to adapt to what we did not imagine.
The text has talked about making plans in several chapters, so we
will assume that subject matter experts have been consulted and
plans have been created. The text moves on to discuss three
distributions for our plans:
office/work location - A paper copy of each plan, plus
electronic copies on critical systems (computers or networks).
Keeping copies on portable devices would also protect
accessibility.
out of the office - Responsible staff should have copies,
paper and electronic, at their homes.
online - Depending on the nature of the disaster, electronic
files may be accessible on remote systems, which may be web
sites that are not
hosted at the work location, or storage sites hosted by outside
providers.
The text continues with a list of trigger events that can lead to
implementation of a plan:
management decision - Management may notify all staff or key
staff that an event has occurred (or will occur soon), and that
we are beginning to run a named/numbered plan.
employee notification - An employee may notify management that
an event is in progress, which will cause a plan to be put into
use. Management decisions are typically required, but the
employee notification is the trigger, and the employee may have
to begin the plan implementation if the usual authority is not
available.
emergency management - A state or federal agency may declare
that an emergency exists, which may trigger a related plan for
our organization.
local emergency - As in the example in the text, a fire or
other local disaster my affect our organization, causing us to
use an emergency plan.
media (news) outlet - A responsible news entity may announce
that an emergency, a disaster, or an act of terrorism has
occurred. If the event involves our organization, this should
also trigger the use of an emergency plan.
As part of the discussion of what to plan for, the author gives
us a list of teams that might be needed during the disaster. The
list gives us something to think about, regarding what they will
do for us and how they will make the situation better.
Disaster management team
Communications team
Computer hardware recovery team
Systems recovery team
Network recovery team
Storage recovery team
Applications recovery team
Data management team
Vendor contact team
Damage assessment and salvage team
Business interface team
Logistics team
The only one that seems to need an explanation is the Business
Interface team. They are the interface between the IT department
and the rest of the organization. Some of what they do might be
included in other teams, depending on how you set up the teams.
The chapter concludes with a series of phases that follow the
trigger event.
Response phase - The people dealing with the disaster contain
it and protect resources according to the plan's hierarchy,
which will probably match the one explained above.
Recovery phase - The things that keep us in business are
recovered first, as addressed in your business continuity plan,
or your disaster recovery plan if there is no BCP.
Resumption phase - Having dealt with the BCP, the other items
in the business impact plan are addressed by recovering them in
an order dictated by their dependencies.
Restoration phase - This phase has more to do with restoration
of the business location. Restore, rebuilt, or relocate? It
depends greatly on what went wrong and how badly the original
site was damaged.
Chapter 11, Business Continuity Planning
Objectives:
This lesson is about chapter 11. Objectives important to this
lesson:
Key elements
Who to include
Construction of the plan
Elements for effective plans
Activating a BC plan
Maintenance and improvement
Concepts:
The chapter begins by
taking us back in time, nine months before the fire that started
at the beginning of the last chapter. All is well, and the
business continuity team is inspecting a possible operations site
for emergency use.
The text compares this chapter to chapter 9, which was about
planning for disaster recovery. It notes that the business
continuity plan has common features, but the two plans have
different goals. Disaster recovery is about
moving from a temporary site back to our original
or new operations site. Business continuity
is about resumption of interrupted operations at
the temporary site. They are both concerned with continuing
operations, but the disaster recovery plan assumes that operations
are currently running.
The text spends its usual number of pages recommending who should
be on the planning committee, reminding us to include decision
makers as well as people who know how the work is actually done.
This includes technical
work (e.g. hardware, software, maintenance) and the actual
work of the organization. The text continues with its templates
about creating the plan and testing it, reminding us to include
controls to avoid our needing to activate the plan. People charged
with creating such a plan may forget that the best approach to
dealing with a disaster is to prevent its occurrence. I am
reminded of a British car company's commercial from several years
ago (no video available, unfortunately) that discussed car crash
tests, and observed at the end that "in England, we endeavor to miss a wall". Let's follow
that model, where we can.
The creation of a business continuity plan is complicated by the
necessity of dealing with an outside
entity, whether that entity is another office in the same
organization or an external space/service provider. The text
proposes a plan that starts with moving some people out of their
usual location into another one that is still within the confines
of our owned/managed space. It goes through six levels, concluding
with moving everyone in the organization to "an external, distant
location". Obviously, the logistical complications become more
involved with each increase in the number
of people to move and the distance
to move them. Each level of complexity requires a separate plan.
Expanding on that idea, the text tells us to make sure we
accurately measure the level of damage and the number of people
affected by the disaster at hand. When a low complexity plan is
engaged, we need to be sure that we have not made a mistake,
moving an insufficient number of people an insufficient distance.
Likewise, we don't want to overreact to the disaster, moving staff
who do not actually need to move. The move will disrupt business
if it is not needed, which is the opposite of our intent.
The state of Michigan, for example, often encounters incidents
that cause staff to be moved, temporarily, from one location to
another. This is often due to a power outage at their usual
location. When this occurs, it is necessary to take as many of the
measures that the text lays out as the incident calls for. Notify
staff, notify the media, notify customers, and continue
notification as the situation changes. The state is a large enough
organization that staff of most departments can move for a short
time to space in other state buildings, usually buildings occupied
by the same department. When this is not possible, higher level
plans are used. The plan for Covid grew to include many workers at
remote locations all week long, and now includes work both at home
and at "permanent" locations.
The text presents a number of considerations that will be
encountered by any organization moving a significant percentage of
its operations. Space, equipment, and services cover most of the
worries. On several pages in the chapter, the text discusses the
actions of "the Advance Party".
In one respect, the committee
that secures that alternate site is the advance party. They make
the first inspection, long before an incident occurs, and they
arrange for expected services, equipment, and space. In another
respect, that inspection and arrangement only sets up an expectation.
The reality is what the
first people we send to an alternate location find waiting for
them. If we know that the location will have equipment that needs
to be set up (warm site?), that first group needs to include the
technicians who can do that job. The second wave does not need to
arrive until some of that set up activity has been done.
As usual, the author stresses testing the plan as a read-through,
as a table-top exercise, as a walk-through, and as a live
exercise.
A thought to take from recent history is that people are always
in motion. It is natural to be going somewhere else. What's hard
is to move everyone
somewhere at once. Maybe we would do better to embrace the idea of
working remotely more often, so that we can continue our
operations regardless of where the staff happen to be.
Whatever your plan turns out to be, be flexible
and make it work. Improve
the plan as you go, if you can, and improve it for next time if
you can't.