Chapters 1 and 2

ITS 4350 - Disaster Recovery

Chapter 1, An Overview of Information Security and Risk Management

Objectives:

This chapter presents an overview IT security concepts. and how they relate to plans for handling incidents and disasters. Objectives important to this lesson:

Key concepts
Risk management
Contingency planning
Relating security policy to contingency planning

Concepts:

Chapter 1

Our text defines contingency planning as being the process that makes us prepared for incidents and disasters related to our organization's IT assets. We are given a few examples of historic incidents and a statistic that makes a good point. The author tells us that "80% of businesses affected by a disaster either never reopen or close within 18 months of the event". These are organizations that either had no disaster plans, or they had plans that were inadequate for the disasters they encountered.

Having made this point, the author spends several pages discussing terms that you will need to know. Many should be familiar to you.

Information Security - protection of information and the systems that collect, store, disperse, and use it

The next set of terms is one you see in a lot of texts about security:

Confidentiality - information should only be accessible to users who have been granted access to it for valid reasons. Only authorized users can access data if it is protected properly, and if authorized users do not violate security policy.
Integrity - data may not be changed except by authorized users or processes. This means that data must be protected from alteration, deletion, or other changes to its intended form.
Availability - authorized users can access data when they need to do so. Availability includes the idea that proper access methods are provided to only to authorized users, not to everyone.

The text confuses the issue with a graphic on page 3 that the author does not explain. The classic CIA concept defines security from the point of view of the IT Security staff. The text should explain that an expansion of this concept is called by several names, one being the McCumber Cube, another being the CNSS Security model. This is the name used in the text. It provides three different perspectives on security, which should be considered together to make better security decisions:

IT Security perspective: Confidentiality, Integrity, Availability
How do we protect the information, make sure it is not tampered with, and provide access to those who need it?
IT Operations perspective: Storage, Processing, Transmission
How do we perform the basic IT functions of storing, processing, and transmitting data? Are our processes secure?
Business perspective: Policy, Education, Technology
How do we make the rules for employees about protecting information, educate our staff about protecting it, and use the technology we have to run our business safely?

This link will take you to a Google search for images that represent these concepts.

The author continues with more terms:

Threat - a potential form of loss or damage; many threats are only potential threats, but we plan for them because they might happen
Threat agent - a vector for the threat, a way for the threat to occur; could be a person, an event, or a program running an attack
Vulnerability - a weak spot where an attack is more likely to succeed
Exploit - a method of attack
Control - A process that we put in place to reduce the impact and/or probability of a risk. The author mentions that a control can also be called a safeguard or a countermeasure.

On page 5, the text presents a list of threat types ranked from 1 to 12 in two different surveys. The rankings changed a bit, but the list is given an aura of accuracy by showing us the same categories in both survey years. Don't hope too much that these are the only threat categories that matter. The surveys were done by the same people, so they used the same categories. The author discusses each of the twelve categories to give you a feeling for what they are. Browse through any that are unfamiliar to you.

On page 13, the author changes topics, and discusses risk. As some of you know, we have classes just about risk management, so this section is not all there is to know about it. However, the graphic on page 13 gives us a nice overview of a workable process for managing risk.

In the first phase, we identify risks, by inventorying and classifying all our assets, and then identifying the threats that apply to those assets, and the vulnerabilities those threats could use against us.
The second phase takes us to the selection of appropriate controls, and justification of their cost and value to decision makers in our organization.

Risk management flow chart

The chapter continues with an expansion on each of the topics in the graphic.

Know something about the big picture - Who and what are we protecting, from whom and what. To know details about these subjects, do everything on the green chart.
Identify, classify, and assign values to assets - To know our exposure to risk, we have to know what we have and how it is exposed. Assets can be classified by their level of secrecy, their value to the organization, their need to be protected, or by combinations of these factors as well as others.

In the chart on page 28, we see five information assets, each on a separate line. Each is given a rating (from 0 to 1) on each of three measures of how a compromise of that asset would affect the company. The three measures in this case are impact on revenue, impact on profitability, and impact on image. Assuming these are the most important impacts our organization cares about, each is given a relative percentage score. In the example, the organization cares 30% about revenue, 40% about profitability, and 30% about image. That is the criterion weight. For each asset, its score for a given criterion is multiplied by that criterion's weight, producing three weighted criterion scores for each asset. The asset's total weighted score is the sum of its three weighted criterion scores. For instance, the first asset has a score of .8 for revenue (weighted criterion score is .8 times 30 = 24), .9 for profitability (weighted criterion score is .9 times 40 = 36), and .5 for image (weighted criterion score is .5 times 30 = 15), so its total weighted score for the comparison is 75. Compare that score to the other lines, and you see that this asset is the third most important asset in this comparison. Warning: do not compare scores from one table to another unless they use the same criteria and the same weights.

The text provides another example of rating assets, based on a military scale that uses four levels for secrecy. A scale like this may be more useful for assets that do not have a particular effect on the organization unless they are compromised.
Threats must be identified, and matched with assets affected by them. Not all threats will affect all assets.
Assets must be examined again, with respect to the threats that could affect them. How vulnerable is each asset to each of its possible threats? This evaluation gets us ready to do the big one in a few pages.

Assuming you have followed the steps so far, there is an important calculation to do.

Each asset needs to be given a value, based on its replacement cost, its current value to the organization, or the value of the income it generates. Pick one of those or some other value you care about. This is the Asset Value. Let's choose $100 as an example for Asset Value.
Next, we need to determine, for each exploit, what the probable loss would be if that exploit occurs successfully. Would we lose the entire asset? Half of it? Some other percentage? Which percentage we pick tells us the Exposure Factor of a single occurrence of that exploit for this asset. Let's choose 50% as an example for Exposure Factor.
We are still not where we want to be. Asset Value times Exposure Factor equals the Single Loss Expectancy. This is the Impact if the event occurs. In this example, it is $50.
Now, we still need the Likelihood the event will occur. The classic way to do this is to consult your staff about the frequency of successful attacks of this type, or to consult figures from vendors like Symantec, McAfee, or Sophos about expected attack rates for your industry or environment. Let's assume we have done that, and we are confident that we expect 10 successful attacks per year in our example. This is the Annualized Rate of Occurrence.
Taking the numbers we have so far, we should multiply the Annualized Rate of Occurrence times the Single Loss Expectancy, which will give us the Annualized Loss Expectancy for this asset from this kind of attack. This corresponds to the Risk Exposure. In the example we are considering, that amounts to $500.

All that work led us to just one loss expectancy for one asset from one kind of attack. That gives you an idea of the work involved in calculating the numbers for each asset, each asset vulnerability, and each kind of attack on those vulnerabilities.

The next idea is to identify controls that can reduce or eliminate our risk. The text mentions five control strategies that are often considered. The terms are a little different from some other texts:

Defense - also called Avoidance, this means to use policies, training, and technology to avoid the situations that can be exploited.
Transferal - this means to hire expertise when you do not have it, or to pay a fee to another department or organization that is in the business of managing risk
Mitigation - this means to reduce the damage that will be done in a successful attack, such as not putting all assets of a given type in the same place, protected by the same defenses.
Acceptance - this is when you decide that a risk is not as costly to us as the controls that might be used to avoid or mitigate that risk.
Termination - this means that we decide to stop doing the things that put us at risk; we simply stop dong the things that use or produce the assets that a risk applies to.

So what do we do if we know that there are risks, and that we can't protect ourselves from all of them? The text introduces four topics from the next several chapters.

Business impact analysis - This process is used to determine the effect that successful attacks would have on our organization. We determine what could happen, what the effects of that event would be, and what state the organization's functions would be in at that time.

Incident response plan - For known incident types, given the BIA done in the section above, what should we do to handle the incident? Who do we call? How do we stop the attack and it effects? This plan is about handling the incident.

The text has the next two topics out of order. Let's fix that.

Business continuity plan - How do we continue business when we have an incident? Do we change our procedures? Do we use alternate locations or resources to continue business? How do we continue providing products or services when part of our organization has been damaged, compromised, disabled, or destroyed? Business continuity plans discuss keeping the business in business during the incident.

Disaster recovery plan - A disaster has occurred. How do we get back to normal, or what will be the new normal? The incident(s) has/have been handled. What do we do to return to our undamaged state, stronger and wiser than we were before?

All four of the major topics above are part of Contingency Planning, what we do when we know things can go wrong. The level of detail in each of the plans will be determined by the size and complexity of the organization making that plan. The text presents more plans identified by the NIST that would apply to organizations like federal agencies.

Earlier in the chapter there are several pages on Information Security Policy. This section introduces the components you might find in a very detailed policy. It begins with some definitions:

A policy is a rule, or a set of rules, that affects how we want our organization and its employees to function. The idea behind a policy may start with a principle, which is often a broad, general statement of what we believe to be right, true, or beneficial. A policy is more detailed, and more specific about what we expect our people to do. Related concepts:

Principle - a general statement about what we believe or require in our area of authority (we will use only two computer vendors at a time); what we expect
Policy - rules about the conduct of our organization with regard to particular actions (we will limit ourselves to particular models chosen by the IT department); how we will approach the expectation
Standard - a method or process that may be procedural or technical (orders are to be placed by approved requesters within each work area); what steps are to be followed to assure general compliance with policy
Procedure - a detailed set of steps to follow to be in compliance (requests are to be made to your manager, who will forward approved requests to your authorized requester); variations or limitations that apply to specific work areas, to be followed if they apply to your area
Guideline - a suggested addition to any of the items above that is recommended but optional (submit your requests two weeks before the end of a quarter to allow processing time); do this to make it work better

On pages 13 through 18, we see a very detailed outline of the parts of a policy.

Statement of the policy - what it is, where it applies, and who has to do what
Authorized access - who is and is not allowed to use equipment or software related to the policy, and what is private about any related data
Prohibited use - a graduated scale of offenses and discipline to be applied for violations of various types
System management - who runs it, who watches it, how it is to be protected, secured, and/or encrypted
Violations of policy - a graduated scale of offenses and discipline to be applied for violations of various types of the policy itself
Policy review and modification - how often the review will take place, who will do it, and the process to change or remove the policy
Limitations of liability - standard lawyer section

Chapter 2, Planning for Organizational Readiness

Objectives:

This chapter is the first chapter about contingency planning. Objectives important to this lesson:

Support of management
Forming a planning committee
Business impact analysis
Collecting data for a BIA
Budgeting contingency operations

Concepts:

Chapter 2

If you browse the scenario that opens chapter 2, you will see several people role playing a contingency. They are doing it in two parts. One is to run through a plan that is already in their operations manual. The other part is to involve the people in the room, to develop questions about the planned response, and to determine whether they are doing what should be done. In this case, they have a plan for a contingency, they are walking through the plan, and they are learning about the merits of the plan. They should not just be reading their assigned parts. They should be thinking about the reality of doing what they are assigned and proposing amendments to improve the plan.

The text talks about forming a body that will be responsible for creating contingency plans. That body has a number of duties, some of which should be done before it starts;

Obtaining commitment from senior management - There needs to be a commitment to empower the committee to even form, to do its job, to examine its output, to revise output as needed, and to require compliance from all employees and departments to accept and follow the plans. And to allow for improvisation as needed; things never go exactly as planned.
Managing the contingency planning process - Assign members and other staff to gather information and assemble procedures.
Writing the master document - There needs to be a place to start, and once it starts, there may be area/division specific procedures that will be written by subject matter specialists.
Conducting the business impact analysis - document threats, vulnerabilities, and attacks, and relating them to documented business functions.
Develop teams to create manuals for incident response, business continuity, disaster recovery, and crisis management.

The text offers suggestions about the kinds of knowledge that will be necessary to create the contingency plans. Note the section about representatives from other business units, which include the company's actual business, IT management, and IT security management. Remember that these are the three axes of the CNSS security model. The text tells us that there also needs to be commitment from management in these areas to create useful plans that will be available when they are needed.

In the scenario at the beginning of the chapter, we are left wondering whether the manuals being used by people from different work areas were identical, or different in some ways. There should be specific pages for some staff, depending mostly on the specialization or security level of their jobs. However, there should also be a master copy with all the sections in case someone needs to step in to do another person's job.

The text delivers a lot of details about a lot of details for several pages. Moving on to more focused material, let's continue on page 57 where we see five "Keys to BIA Success":

Set the scope to cover the necessary work areas of the organization for each risk to be addressed.
Get information from experts in your organization about the impact an exploit will have.
Keep the information factual where possible, to avoid opinions that may be mistaken.
Determine the areas you need to report on before gathering the data. This will save time, and be of more interest to approvers.
Get approval of your BIA and your risk assessment. Without it, your process will stop there.

The first of three phases of a business impact assessment begins on page 58 with making a list of the critical processes/functions of the organization you are studying. In the first column of the example chart on page 59, you see seven functions performed by a company. (The text notes that this chart of seven functions is an example, and that a real chart for a real company would be much longer.)

Function	Profitability:40%	Strategic Value:30%	Internal Ops:20%	Public Image:10%	Total Weighted Score (100%)
New business	8	8	3	6	6.8
Maintain old business	8	7	6	7	7.2
ISP service	10	8	4	8	8
Internet services	9	10	4	8	8.2
Help desk	5	6	6	8	5.8
Advertising services	6	9	4	9	6.8
Public relations	4	6	2	10	4.8

In the columns to the right of each function name, four kinds of impact are considered. Each kind of impact has been given a percentage rating, reflecting how much the company cares about it. Note that the sum of the percentages is 100%. If it were not, we would have to assume we are not measuring the impacts correctly, or we are measuring the wrong impacts.

Each function is rated on a scale, probably from 1 to 10, on how much its loss would affect each kind of impact. Note that the columns and the rows do not add up to a specific value. There is no presumption that they should. To get the weighted score for a function, its raw score for each column is multiplied times the percentage for that column, and each of those weighted scores is added together. The weighted score for each function is a measure of its criticality to the company's ongoing business.

In the chart above, I have marked the three functions that have the highest weighted scores. They are the ones we need to protect the most. Preparing a chart of this sort, or a series of them, leads to our knowing which business functions should get our attention first and foremost.

Page 60 brings up a related idea. Some functions will have a different criticality when we consider which to restore from a damaged or disabled state first. Some functions rely on other functions to make them possible. In cases like this, it is necessary to prioritize the recovery of any service that a critical service depends on. With that in mind, it makes sense to consider the downtime metrics discussed in text. (This is considered further in the table on page 60.)

Maximum Tolerable Downtime (MTD) - In the text, we see the example of a system that can only be down as much as 4 hours in a month. A more specific tolerance would state that the four hours need to be scheduled as one hour each week, on a day or shift during with the system is not needed. There is a difference between that standard, and four hours all at once, and four hours spread evenly across a month. This measure of time refers to normal operations.
Recovery Time Objective (RTO) - This time is similar to the measure above, but it refers to the time a system can be allowed to be down during a recovery. It assumes that a disaster has occurred, that emergency procedures are in effect, and that we intend to restore this system to normal operation. Shorter times are assigned to this measure for systems that are critical to our operations. The more critical a system is, the less time we can do without it.
Recovery Point Objective (RPO) - This one is harder. It is a measure of how current the most recent backup is, and how much data we can expect to have to load to that backup to become current again. The text expresses it as a number of work hours that will need to be captured and added to the most recent backup once it is restored. In the discussion on page 61, the text gives us another way to look at it. It calls the RPO the amount of data we can live without during the recovery process.

In the small chart on page 61, the text shows us two curves plotted on the same graph. The Cost to Recover is highest if we have a constant live copy of the data, and lowest if we use an old-fashioned tape backup system. The old system looks good until we note that the time it takes to use it causes a much longer disruption time, which has associated costs that go higher the longer the disruption takes.

The simplified version of this chart, shown on the right, makes the same point, but may be easier to see. The projected cost to the organization in any scenario using these curves is the total cost of the red and the blue lines at any given point.

Reasonable expectations about tolerable downtime and recovery time lead to a compromise that the text shows as the Cost Balance Point. If we spend more on our recovery system, we can expect less time that the system will be down, lowering the costs that down time creates. You need to plot those two curves for your own organization to determine what the best choice is.

The text returns to the idea of determining the priority of each process to the business, and each system or asset to the functions that use or depend on it. This is discussed again on page 62. At this point, you would think you would know a lot about the company's assets, functions, and priorities. the text turns to data collection. Eight data collection methods are listed on page 63, then discussed for several pages. Which one is best? The one that gets you to the truth. Keep that in mind. People often answer questions with either your agenda or their own in mind. Ask without pressure, and you may get better responses.

The last major topic in the chapter is budgeting. The text lists four operations that will require a budget. They are four familiar areas by now.

incident response - The text points out that this is a normal, expected IT operation. Not every incident is a wide-spread emergency, but they all need to be handled.
disaster recovery - The text proposes that the largest cost of recovery is insurance. The best recommendation we can make about it may be to consult industry associations to determine what are considered to be best practices.
business continuity - This one requires you to estimate and collect money for extra locations, employees, equipment, and data devices, to be used during emergencies.
crisis management - This item concerns large scale disasters that bring huge physical losses and long term psychological damage. It may also concern more predictable losses and expenses to employees, such as funeral costs. A lot depends on the benefit packages your employees may already have.

-->0