|
|
CSS 211 - Introduction to Network Security
Lesson 8: Chapter 13, Business Continuity
Objectives:
This lesson covers chapter 13 in the text. It discusses
business continuity planning and activities. Objectives important to
this lesson:
- Environmental controls
- Redundancy planning
- Disaster recovery procedures
- Incident response procedures
Concepts:
Business Continuity
Chapter 13 begins with some remarks that
relate to business continuity, which means that we keep running the business
even though we have had a problem, a setback,
or a disaster. We should consider the author's remarks
about the effects of a business disruption that might be reduced by detailed
planning and action based on such planning.
Some definitions from the text may be helpful in understanding the point
of the chapter.
- business continuity - continuation
of operations and services despite a disruptive event
- continuity of operation - same as
business continuity
- risk assessment - analyzing risks,
their effects, and what we can do to reduce
the probability of their occurrence and of their effects
- planning and testing - identification
of risks and threats, creating plans to deal with them,
and conducting tests of those plans
- business impact analysis - identifying
mission critical business functions to prioritize our
continuity plans
- IT contingency plan - a plan to continue
to provide services in case a particular
incident disrupts normal service; a separate plan will
be needed for each type of incident that can occur
- disaster recovery plan - restoring
services provided by the enterprise to their standard state; this is
not just about IT services
The text discusses document format and content which will vary greatly
from one enterprise to another. It also discusses disaster exercises,
which will vary as well, but will be more similar across organizations.
In different organizations, several different kinds of plans are made,
called by different names that relate to the circumstances of the event
and the scope of the plan.
- Business Impact
Analysis - The green
highlight on this bullet is to show that this step should be done when
times are good and we can examine our systems performing normally.
Before you can plan for what to do, you have to figure out what is normal
for your business, what can go wrong, and what can be done to minimize
the impact of incidents and problems/disasters (see the bullets below).
- What are the business's critical
functions? Can we construct a prioritized
list of them?
- What are the resources (IT
and other types as well) that support those functions?
- What would be the effect of a successful
attack on each resource?
- What controls should be
put in place to minimize the effects of an incident or disaster? (Controls
are proactive measures to prevent or minimize threat exposure.)
- Incident Response Planning -
The red highlight on this bullet
is to acknowledge that the plans made in this step are used when there
is an emergency for one or more users. (Shields up, red alert? Why were
the shields down?)
The text is consistent with the ITIL
guidelines that call a single occurrence of a negative event an incident.
An incident response plan is a procedure that would
be followed when a single instance is called in, found, or detected.
For example, a user calls a help desk to report a failure of a monitor
that is under warranty. (Note that this is an example of an IT incident,
not an IT security
incident. What further details might make this part of a security incident?)
There should be a common plan to follow that will repair or replace
the monitor. Incident Response Plans (Procedures) may be used on a daily
basis.
- Business
Continuity Planning - The orange
highlight is meant to indicate that these plans are not concerned with
"fighting the fire", but with conducting business while the fire is
being put out.
Business continuity means keeping the
business running,
typically while the effects of a disaster are still
being felt. If we have no power, we run generators. If we cannot run
generators (or our generators fail), we go where there is power and
we set up an alternate business site. Or, if the scope of the event
is small (one or two users out of many) maybe we pursue incident management
for those users and business continuity is not a problem.
- Disaster Recovery Planning -
The yellow highlight here is
to indicate that the crisis should be over and we are cleaning up the
crime scene with these plans.
Determining that a disaster has even occurred requires that we judge
its scope. One person having a desk full of paper ruined by spilled
water is not a disaster. (For perspective, consider the legend about
Isaac Newton,
who reportedly handled a worse circumstance with more grace.) A disaster
requires widespread effects that must be overcome. A disaster might
be most easily understood if you think of a hurricane, consequent loss
of power, flooding that follows, and the rotting of the workplace along
with the ruined computers and associated equipment. A disaster
plan is what we do to restore
the business to operational
status after the disaster is over. There may be specific
plans to follow for disasters under the two bullets above, but the disaster
recovery plan is used after the crisis, unless
this term is applied differently in your working environment. Multiple
incidents can become a disaster, or may lead us to realize that there
is one, especially if there is no plan to overcome them.
- By the way, in ITIL terms, a series
of incidents may lead us to discover what ITIL calls a problem,
something that is inherently wrong in a system that might affect all
of its users. Some books call this a disaster.
The organization you work for may use all three terms, or any two of
them to mean different scopes
of trouble. You need to know the vocabulary to use in the setting where
you work, and you need to call events by the names they use.
- Is there a condition for a blue highlight? We
might pretend there can be, but it is unlikely that the IT Security
staff would ever feel that safe and serene.
The text discusses redundancy and fault tolerance.
Normally, we consider redundancy something that we should reduce
in a computer system. For the purpose of business continuity, redundancy
has virtues. If we only have one of
anything that is critical to our business, we will have
a hard time continuing to operate without it. This is what the text means
by a system having a single point of failure. It is not
the only place we can have a failure, but it puts us out of business if
it fails.
The text presents a table of downtimes expressed as
a percentage of a year (Table 13-2), and as the equivalent
time the system would be down if it were off line that often in a week
or a month. This is interesting, but you should realize
that system outages measured as annual amounts may be clustered
around only a few events, not spread evenly throughout
each month, week, or day of the year. You should be aware of the notation
regarding the number of nines in each example, since it is commonly used:
Percentage |
Name |
Yearly Downtime |
90 |
One Nine |
36.5 days |
99 |
Two Nines |
3.65 days |
99.9 |
Three Nines |
8.76 hours |
99.99 |
Four Nines |
52.56 minutes |
99.999 |
Five Nines |
5.26 minutes |
99.9999 |
Six Nines |
31.5 seconds |
Note that we are not really under an hour of downtime for the year until
we reach five nines, which is more reliability than the average customer
would think to ask for until they see a chart like this. Do you think
an hour isn't much? What if it happens all at once? What if your cable
system was out for an hour a year? How about your phone? How about the
911 service in your area? How about the power to a hospital or to an air
traffic control system? For some systems, failure
is not an option.
To achieve a system with no failure rate, we must analyze
the system, determine its mean time to recovery (MTTR),
and find what we must do to reduce that number to zero.
The obvious answer in many cases is to have redundant
components that provide the same service. The text discusses adding redundancy
in several areas.
- servers - servers can be installed in clusters,
in which multiple devices provide the same services in case one or more
go down
- asymmetric cluster - one server
is designated as the standby (replacement) for
another; the standby server does nothing unless the first server
fails
- symmetric cluster - each server
in the cluster provides services at all times; if one
fails, its services are provided by the remaining
servers
- storage - The text discusses RAID,
which has been defined several ways. Eventually, all hard drives fail,
and RAID allows a system to continue in most cases. One common meaning
is Redundant Array of Independent Drives. The word
"independent" seems unnecessary, and is in fact misleading. Hard drives
set up in a RAID array perform functions that relate to each other.
Several kinds of RAID exist to provide for redundant storage of data
or to provide for a means to recover lost data. The text discusses four
types. Follow the link below to a nice summary of RAID level features
not listed in these notes, as well as helpful animations to show how
they work. Note that RAID 0 does not provide fault tolerance,
the ability to survive a device failure. It only improves read-write
speed.
RAID levels and features:
- RAID 0: Disk striping
- writes to multiple disks, does not provide fault tolerance. Performance
is increased, because each successive block of data in a stream
is written to the next device in the array. Failure of one
device will affect all data. This will provide
a performance enhancement by striping data across
multiple disks. This will not improve fault tolerance,
it will in fact decrease fault tolerance.
- RAID 1: Mirroring and Duplexing
- provides fault tolerance by writing the same
data to two drives. Two mirrored
drives use the same controller card. Two duplexed
drives each have their own controller card. Aside from that difference,
mirroring and duplexing are the same: Two drives are set up so that
each is a copy of the other. If one fails, the other is available.
- RAID 5: Parity saved separately
from data - Provides fault tolerance by a different method. Data
is striped across several drives, but parity
data for each stripe is saved on a drive that does not hold data
for that stripe. Workstations cannot use this method. It is only
supported by server operating systems.
- RAID 0+1: Striping and Mirroring
- uses a striped array like RAID
0, but mirrors the striped array onto another
array, kind of like RAID 1
- networks - The text mentions that some entities need
redundant connections to and through networks. It does not give specific
details about this concept.
- power - Power can be supplied to a computer system
through an Uninterruptible Power Supply (UPS)
that is essentially a smart battery that kicks in when the main power
is lost. The text describes two kinds of UPS:
- off-line (also called standby)
- keeps a charge on a battery which it uses to supply power in case
of a total loss
- on-line (also called inline)
- also has a battery, but it constantly provides
power from it, while continuously charging
it from the standard electrical power
The off-line (standby) model has a short
lag time in the event of a power loss before the battery
circuit starts working. The on-line (inline) model
does not have this lag time. A typical UPS works
with software that detects a power loss and alerts administrators
when it occurs. Depending on the capacity of the UPS and the load
placed on it, it may allow operation for hours, for minutes, or only
long enough to perform a shut down of the system it is protecting.
Backup generators are typical in large installations,
such as data centers that support a large population or enterprise.
- sites - In the case of a disaster that makes a work
site unusable, such as a fire or flood, it becomes necessary to have
a plan for alternate means of continuing business. The text lists three
types of off site operation plans, and a more recent addition regarding
"the cloud":
- cold site - a basic site with office space, but
without computers or other devices that you would have to supply,
without established connectivity, without a data copy unless you
can supply it
- warm site - has office space, hardware, and may
have connectivity; may have a recent backup of your data, but it
will have to be loaded on computers that may also have to be configured
- hot site - a functional duplicate of the site
that has gone down, including office space, computers, connectivity
to the Internet, telephone service, and the capacity to either load
a backup of your data that is stored there, or to use a copy of
your data that is already in place
- cloud site - use cloud storage or cloud computing
in conjunction with your site strategy; this may mean that you restore
from cloud storage, or that you use virtual cloud computing, or
both
The definitions of hot, warm, and cold sites vary between sources,
but the basic idea is always the same. The three types of sites provide
different levels of service and different time frames in which you
would be ready to resume business. Obviously, the hot
site is best but it requires the most money
and effort to maintain. The cold site is cheapest,
but it has additional costs that will be added as soon as you need
to use it.
Data Backups
Having introduced the idea of backups, the text discusses three common
methods used to create them. First some terms:
- Target - the device, volume, folder, or group of
files being backed up
- Archive bit - a bit in a file that is turned ON
when the file is changed; it is used to flag files
that have changed since the last backup: most backup
programs look for files whose archive bits are set to ON, copy those
files, then reset the archive bits (turn them OFF) on the target
files
- Full - a backup of all files in the
target; sets the archive bit of each file to OFF once
the backup is made
- Incremental - a backup of target files that
are new or changed since the last backup; depends
on the fact that programs that change files typically set the archive
bit to ON when a change is made; sets archive bit to
OFF for all files it copies
- Differential - a backup of all files new
or changed since the last Full backup; copies all files
whose archive bit is set to ON; does not
change the archive bit of files it copies because they will be copied
again in the next differential backup
- Copy - like a Full backup, but it does not
change the archive bits of files it copies. This is typically not part
of a standard backup strategy, but an option to work around the system.
This needs more explanation. Assume we use a tape drive (more on other
options in a minute) to make backups. In a Full backup strategy,
the entire target is backed up to tape every time we make a backup tape.
This strategy consumes the most time and the most tapes
to carry out a backup. To restore, we simply restore the most recent
tape(s). This is the least time consuming strategy for restoring,
but the most time consuming for creating backups.
The second method, Incremental backup, means that we start with
a Full backup of the target, and then each successive backup tape
we create only backs up the elements that are new or changed since the
last backup was created. This means that successive backups will
not always be the same length. Therefore, this is the least time consuming
backup, but the most time consuming restore. To restore,
we must first restore the last Full backup made, and then restore
EVERY tape made since then, to ensure getting all changes.
The third strategy, Differential backup, also starts with a Full
backup tape. Then each successive tape made will contain all the files
changed since the last Full backup was made. This means that we
will have to restore only one or two tapes in a restore operation.
If the last tape made was a Full tape, we restore only that one. If the
last tape made was a Differential tape, we restore the last Full tape,
then the last Differential tape.
The fourth strategy, Copy, is not mentioned in this
text, but it is no different from Full in terms of backup or restore time.
In both Incremental and Differential backup strategies, you will typically
use a rotation schedule. For example, you could
have a one week cycle. Once a week, you make a Full backup, then every
day after that you make the other kind you have chosen to use: Incremental
or Differential.
To keep them straight in your mind, remember these facts:
Backup type |
What does it back up? |
What does it do to the archive bit? |
Full |
copies everything |
Resets all archive bits in the target set. |
Incremental |
everything different from the last backup |
Resets the archive bits of the target files it
copies. |
Differential |
copies everything "different from Full"
(Different from the last Full backup.) |
Does not reset any archive bits. |
Copy |
makes a Full backup |
Does not reset any archive bits. |
The time required to create backups should be considered
along with the time to restore a backup. When you consider the
two concepts as two sides of the answer to a question (What method should
I use?), the answer may be the most common choice: Differential.
It is the best compromise in terms of backup time versus restore time.
Note also, that all standard methods require a full backup on a regular
cycle. The recommendation is usually to run a Full backup weekly.
The discussion above assumes that your backups are being written to tape,
which has been the most common method for many years. The text discusses
three other methods, each requiring different hardware.
- Disk to disk - Copying to other
drives is faster, but only if connected by a fast channel, such as being
in the same computer. This leads to a problem of removing the copy from
the same location as the original. Copying to a disk in another data
center is possible, and fast if they are connected by fiber, but costly
in terms of setup.
- Disk to disk to tape - Copying to
another disk, then backing up that disk to removable
storage reduces the time that your live server disk needs to be offline.
- Continuous data protection - Copies
all data to a backup device in real time, possibly
by using disk mirroring.
The text moves on with some observations about fire protection,
electrical shielding, and problems concerning Heating,
Ventilation, and Air Conditioning (HVAC).
Fire
The first threat considered is fire. Some statistics are given and the
author presents a list of four elements that must be present for a fire
to exist. His fourth element is the fire itself, so it does not belong
on the list. The other three are those we discussed in class earlier in
the term.
For a fire to exist, three factors are needed:
If you can eliminate any one of these factors, the fire will go out.
This is why Carbon Dioxide extinguishers work: the CO2
replaces the oxygen in the immediate vicinity of a fire,
and the fire stops. Smothering a campfire works about the same
way.
A fire
break is an example of fighting a fire by depriving it of fuel.
Forest fires can be fought this way. Somewhat similarly, I once walked
into a rest room in an office and found that someone had placed a roll
of toilet paper on top of the light fixture over the sink. I noticed it
because it was on fire. I grabbed the roll of paper and tossed it into
the sink. This established a fire break between the fire and the rest
of the building. I then put out the fire on the roll of paper with water
(depriving it of oxygen).
Keeping your computer system cool, so that a fire will not ignite,
is your most effective form of firefighting: don't let it start.
Fire Extinguishers - American fire extinguishers are classed by
the kind of fire they are able to put out. The links below will take you
to sites with more information about fire classes and extinguishers. In
surveying several sites, I found that there are currently at least four
classes of fires, and that the symbols for them have been updated to use
pictures instead of letters. Some sites list a Class K for cooking
oils (Kitchen fires), but this does not seem to be universal. The chart
below contains American symbols:
Description of Extinguisher Class
|
Letter and Shape Symbol for Class
|
Picture for Class
|
Class A: paper, cloth, wood. |
|
|
Class B: oil, gasoline, kerosene, propane. |
|
|
Class C: electrical |
|
|
Class D: combustible metals, such as magnesium, potassium,
sodium |
|
|
Class K: combustible cooking oils |
|
|
The table below is from a Wikipedia
article on fire classes. It shows that the same kind of fire is called
by a different name in different places:
Comparison of fire classes
American |
European |
Australian/Asian |
Fuel/Heat source |
Class A |
Class A |
Class A |
Ordinary combustibles |
Class B |
Class B |
Class B |
Flammable liquids |
Class C |
Class C |
Flammable gases |
Class C |
UNCLASSIFIED |
Class E |
Electrical equipment |
Class D |
Class D |
Class D |
Combustible metals |
Class K |
Class F |
Class F |
Cooking oil or fat |
In most cases, a multiclass extinguisher is preferred. On extinguishers
I examined at my workplace, multiple picture symbols were used, showing
the pictures for classes A, B, and C.
The text discusses some fire extinguishing systems. Common types are
sprinkler systems, foam systems, and gas dispersant systems.
- Sprinklers typically spray streams of water or water
mist. The test in the video behind this link seems to point out
a limitation of automatic mist.
- Gas dispersant systems used to use Halon,
and still can, but they are restricted to existing Halon supplies. Carbon
dioxide is an alternative, but both solutions tend to be dangerous to
air-breathing life forms in the immediate area.
- Another system uses foam as a suppressant, and the
people testing
this system seem to be enjoying it greatly.
- The text also mentions dry chemical systems which
spray a fine powder over the fire. This is similar
to using baking soda to fight a small kitchen fire.
Electromagnetic Shielding
The text discusses the fact that all kinds of electrical equipment radiate
electrons to one degree or another. In this section it is important to
know a few facts.
- Faraday cage - a metal enclosure that prohibits an electromagnetic
field from crossing it; a metal PC case is a Faraday cage that keeps
emissions from leaving the immediate area; A Faraday Cage is named for
Michael
Faraday
- TEMPEST - a standard developed by the NSA, TEMPEST
may not actually be an acronym, it is a set of standards to reduce and
shield emissions with the purpose of reducing the risk of eavesdropping.
HVAC
The text discusses some ideas about heating, ventilation, and air conditioning,
which include some concerns about humidity. Why do we care about humidity?
Higher humidity (50% or higher) inhibits ESD (Electrostatic
Discharge).
Static electricity - ESD, or Electrostatic
Discharge, can be a serious cause of problems. Some numbers from
a previous text may help you understand the situation:
- A human can't feel a static discharge unless it is
3,000 volts or more.
- Normal motion, like moving a chair or a foot can
generate 1,000 volts.
- Simply walking across a carpeted area can generate
1,500 to 35,000 volts.
- Handling a plastic envelope can generate 600
to 7,000 volts.
- Picking up a plastic bag can generate 1,200
to 20,000 volts.
- Damage can be done to computer parts with 20
to 30 volts.
The damage from low voltage may not cause immediate failure so you may
never know the cause of the failure that eventually happens.
Incident Response Procedures
An incident can be an event of any sort, but some texts, ours included,
call an incident an event caused by
an attack. The remainder of the chapter concerns the
actions that should be taken when an incident has been detected.
Forensics
A forensic investigation is typically one that concerns
a crime. This section is about computer forensics, investigations
into crimes that involve computers and other information system equipment.
The text discusses four aspects of an investigation:
- secure the scene - The team mentioned in the text
may be called an Incident Response Team or a Forensics Response Team,
or another title that means the same thing. They are responsible for
taking possession of devices that might hold any data that might contain
evidence of the crime being investigated. In addition, they should photograph
the scene, document their observations, and record interviews with witnesses.
- preserve and collect the evidence - This aspect is
closely related to the first, in that the response team may have to
take images of data in RAM that would be lost if not recorded before
the power is turned off. Note the Order of Volatility
(Table 13-9), which indicates the order in which to capture data from
a running system:
- Register, cache, and peripheral memory first
- Random Access Memory (RAM) second
- Network state third
- Running processes fourth
- establish (and maintain) the
chain of custody - There must be a continuous documentation
of who has had access to seized devices and data, who has done what
with it, and who it is turned over to at each change in custody.
- examine for legal evidence - Although the other discussions
have used the word "evidence" several times, this one brings up the
point that not everything you find is actually legal evidence. Only
things that indicate or prove a crime was committed can be considered
as evidence that will be presented in court.
The text elaborates on memory and storage
locations that should be examined for meaningful data. What you can
expect to find there may surprise you:
- Windows page file - This is a hidden
file, typically on the boot drive, that Windows uses to store "memory
pages" that it thinks you are not using presently, like memory devoted
to an application that is minimized. The file is probably in the root
of the drive, and is probably called pagefile.sys.
You should expect to see pieces of anything that the computer was used
to work on, especially if it was minimized while the user worked on
something else.
- RAM slack - This will take a minute.
When Windows saves files, it saves to sectors (track
sectors) on a drive. Sectors are logically arranged in clusters,
which are the smallest storage area a file system can use. The number
of sectors in a cluster varies by the way a drive was formatted. When
a file is saved, it will take a certain number of clusters to hold it,
but the file itself may not actually fill the last sector used in the
last cluster used to store it. When this happens, older
versions of Windows (before NT) did something you may never have heard
about. They filled the last sector used for the file
with data pulled randomly from RAM.
This data is called RAM slack, a copy of a piece of
RAM that has been stored in the slack space at the
end of a sector. Why did it do this? Windows just worked that way: it
had to fill the rest of the sector. You never knew what you'd find in
it. Since NT, the RAM slack space has been filled with zeros, so this
is less of a problem as time goes by, except for stand-alone, legacy
systems that use older versions of Windows.
In the simplified illustration below a file has been saved to two four-sector
clusters, but only fills six and a half sectors. The cluster marked
in cyan is full: all
four sectors have been used by the file. The second
cluster is not full. The second half of the seventh sector (item F)
is filled with RAM slack. The eighth
sector has not been used at all, but we will cover that in a minute.
Note: for many years, a track sector has held a standard
512 bytes regardless of what device it was on. As of
January 2011, this was no longer true. A device using
Advanced Format on a system that understands it may
use sectors that hold 4096 (4K) bytes. This leads to
lots more room in a RAM slack situation. If a computer using such a
device is running an older OS, the 512 byte limit for sectors still
applies.
- Drive slack - So, if the cluster holds
a specific number of sectors, what if the file only used some
of those sectors when it was saved? Does Windows fill the rest of those
sectors, too? No, but something else interesting happens. If anything
was ever written to those sectors before, it remains
there undisturbed until there is a need to write to them. This means
that some sectors at the end of a cluster may hold
old data that the user thought was deleted. The data held in those sectors
is called Drive slack. You never know
what might be in it.
In the illustration below, the last sector of the second cluster (item
G) is Drive slack. The new file has not overwritten
whatever was in that sector already.

|