|
|
NET 226 - Designing Internetwork Solutions
Chapter 2, Analyzing Technical Goals and Tradeoffs; Chapter
3, Characterizing the Existing Internetwork
Objectives:
This lesson discusses the technical goals of network design
and conflicting goals that lead to tradeoffs. Objectives important to
this lesson:
- Scalability
- Availability
- Performance
- Security
- Manageability and usability
- Adaptability and affordability
- Tradeoffs
- Characterizing the network infrastructure
- Checking the health of the existing internetwork
Concepts:
Chapter 2
Scalability
Scalability is the ability of a network design to adapt to growth. This can also mean adapting
to downsizing, but people
usually mean the ability to perform well when the organization adds users, locations, customers,
services, or other demands that require more from the network. The text
presents some standard questions to ask a client on page 26. They ask
the client to estimate the number of additional
sites, users, and servers that will be needed in the
next year, two years, and five years.
The text discusses an evolution in computer networks on page
26. It tells us that the computer business began with centralized data storage and
computing in the era of mainframes, and changed in the 1980s and 90s to
follow a decentralized model.
Decentralization began with improvements in processing power that let
users have powerful workstations and administrators have powerful
servers that could handle their needs locally. This led to the rule
that 80% of your traffic should stay on your LAN, and 20% of your
traffic may need to go to other LANs or more distant networks.
This was a good model, but it changed when the Internet
became popular, and people got the idea that you could reach any computer
anywhere in the world across a mythical superhighway of information. Myths
aside, companies decided to have fewer
data centers, and more connectivity
to them from remote sites. This has advantages in terms of fewer servers,
fewer administrators for them, and more consolidation of data. Note the
list of goals to be achieved in
this kind of scenario provided on page 27:
- Make separate LANs part of the corporate intranetwork
- Solve bottleneck problems caused by all users passing
traffic across LAN and WAN links
- Centralize data servers and centers (Note that this does
not mean they are geographically central.)
- If you are using mainframes, make them part of the IP
network to share data
- Support users in the field and regular telecommuting
- Support secure connections, when needed, with customers,
vendors, and business partners
Each of these these goals is affected by the scale of the
network, and assumes that we can scale it up at any time.
Availability
What percentage of the time does our network need to be
available? The text offers a simple math problem on page 27 to begin
the discussion. Assume a business that does not close, such as an
online vendor. They are open 24 hours a day, 7 days a week. Multiply
those numbers and you should get 168 hours a week. How much downtime
can we afford to have? In the example in the text, the network is up
for 165 hours per week. That means that the network is up 98.214
percent of the time. It also means it is down for three hours a week.
All at once, or spread out a bit? It probably makes a difference. What
do we mean by up?
The text tries to clarify the meaning of availability, but
seems to muddle it more with other related words. Let's try a list,
showing them together with short definitions.
- availability - the
amount of time that a network is operational, compared to the amount of
time it is meant to be operational
- reliability - a
more general concept that is reduced by errors, equipment failures, and
inaccurate performance; a network might be available, but not reliable
- resiliency - the
ability to operate under stress or heavy loads; the ability to continue
to work or to be easily restored when there is a disaster
- redundancy - having
more than one way of doing something; redundancy can contribute to a
network being more available, reliable, or resilient, because it
anticipates something going wrong and provides another way to handle
the need
The text briefly discusses disaster
recovery on pages 28 and 29. We are advised to determine which
parts of our network are critical
to its being functional, and to determine how to work around the
failure of those parts in a disaster. As noted above, we need to decide
what we must have, what we can do without, and what we are going to do when the
parts we must have fail, cannot function, or are destroyed. The text
advises us to prioritize our needs, to establish backup devices and
data stores, and to test the solutions we choose to enable. Note that
this text is referring to the processes that take place during and after the disaster as disaster
recovery. Other texts we have seen break this into two parts: business continuity (while the
disaster is still taking place) and disaster
recovery (when we clean up and try to return to normal after the
disaster is over).
Returning to availability, the text again considers measuring
the percentage of time a system is available. Customers tend to assume
that a system will be available at all times. They need to be made
aware that some downtime should be expected, and how much. As noted
above, a customer may care about how long a system is down, but they
may care as much about when the downtime occurs and why it occurs.
- Does the expected downtime represent scheduled maintenance,
or does it represent the rate at which this system regularly fails? The
difference tells us whether the customers need to adjust their
expectations or the network administrators need to correct a problem.
- For a company that never closes, there is no good time for
scheduled maintenance, but there will still be times that are better
than others. Pick times when the system is not busy, or make a case for
more redundancy that would allow parts of the system to be off line for
the expected maintenance.
- Is the reported down time an average or a total that
misrepresents the behavior? If the system is down for a total of an
hour each month, is that a few seconds at a time or an hour all at once?
- What number and what measurement are most important to the
customer? We should specify both up time and down time, but we should
do so in units that mean something to the customer and that tell the
truth about the system's performance.
In the example on
page 29, the system in question has 99.70 percent up time. The text
ponders whether this is all at once or randomly distributed. Either
condition could be reported as an average of 30 minutes per week, or
10.70 seconds per hour of down time. The second measure does not sound
so bad, but they are only two ways of looking at the same measure,
neither of which may be telling us what we want to know. The customers
want to know how long they will have to wait when the system is down,
and how often this happens.
Some systems must have extremely low down times and extremely
high up times. The text introduces the idea of five nines up time on page 30.
This means the system must be up 99.999
percent of the time. The text explains that this is a down time
of about 5 minutes per year. (Just over 5.25 minutes per year,
actually.) Follow the link above to an article on Wikipedia to see this
level of service compared to lower and higher levels. We are cautioned
that five nines had better not include scheduled maintenance time, and
even then it may not be possible, unless we can do maintenance while
the system is running. This level of up time sounds desirable, but it
is not attainable without high levels of redundancy that the customer
may not be able to cover in a reasonable budget. The text suggests that
the level of redundancy required might be acceptable if the service is
provided to multiple clients at once using the same hardware. This is
what the text means by "collocation centers", locations at which the
hardware and software provide failover service to multiple clients.
More new terms are used in the discussion that begins on the
bottom of page 31.
- MTBF - Mean Time
Between Failure is a statistical phrase that usually has to do with the
average number of hours a device can be expected to run before it
fails. (A mean
is what most people think of as an average.)
- MTTR - Mean Time To
Repair makes sense if you understand MTBF already. This is the average
time it takes to repair a device or service once it has failed.
- MTBSO - Mean Time
Between Service Outage is sometimes used instead of MTBF when we are
talking about the average time between service failures instead of
device failures.
The text explains on the next page that if we have values for
MTBF and MTTR, we can calculate availability as:
Availability =
MTBF / (MTBF + MTTR)
A problem with this concept is that this works well for a
single product because we would rely on statistical data from the
manufacturer or from an unbiased source like Underwriters Laboratories.
A network is not a single product. It is a compilation of many devices
and services, and we should not rely on the data about any one of those
devices as a definitive measure of that network. You may not be able to
construct data about a network because it will change many times in the
span of time required to collect such data. This approach is best used
for individual devices and components with known values.
Performance
The author tells us that we should consider our discussion of
performance in the light of the plans for expansion that we gathered
from the customer in the first chapter. We should not put undue effort
into analyzing the performance of a system we are about to change, but
we should understand it before we change it, which will be addressed in
the next chapter.
On page 33, we see a list of terms relating to performance:
- Capacity - the
theoretical data-carrying capability of a circuit or network; may be
measured in bits per second (bps) or by some multiple (e.g. Mbps)
- Utilization -The
percentage of the capacity that is in use. The text recommends that the
best utilization to seek is 70%, which allows for bursts and peaks of
activity above that level. It also warns us that the expected
utilization for links between computers and switches is lower than for
links between switches, routers, servers, and other network
bottlenecks. Those links are expected to handle more traffic, so their
bandwidth should be designed to carry more traffic.
- Optimum utilization - the percentage of
utilization just below saturation
- Saturation - The
text does not define saturation, which is the state at which the
network or circuit can handle no additional traffic.
- Throughput - The
quantity of error free data
successfully transferred between nodes in a specific time interval.
Note the difference between this concept and capacity, listed above.
The text compares throughput to bandwidth on page 35, making the point
that we can put thousands of packets down a wire, but if only a few
hundred are usable, they are the only ones that count. Another aspect
that limits throughput is mentioned on that page. The access method
used in most networks requires that nodes wait when the network is very
busy.
The text discusses testing that is often done for internetworking
devices on page 36. When the test results are expressed in packets
per second, you should know that the tests are often done with small
packets and multiple streams through the device on multiple
ports. That is a normal function of the device, but it is not normal
for one user to push data in that manner. The numbers obtained in this
sort of test are artificially inflated, so the bottom line is that we
cannot rely on marketing material for testing. We should do our own,
based on the needs or characteristics of our network.
Throughput can also be measured by data pushed by applications. The
problem with this is explained on page 37, that the "throughput" being
measured can include overhead and retransmissions. This
is not what throughput is supposed to mean, but it is how it can be
measured, which borders on being dishonest about your device's stats. A
proper measurement would be to measure how long it takes to
push a large collection of data, without errors. This
is not to say that there cannot be errors or retransmissions. The text
means that we should, for example, determine how long it actually takes
to get a good duplicate of the data collection, not how long it
takes to push a specific number of packets or frames through the
network. On the bottom of page 37, there is a list of factors that
affect data transmission rates. Some, like frame size, can be modified,
but they may also be modified dynamically by network protocols as
conditions on the network change.
- Offered load - the total of all the bits
that all the nodes on the network are ready to transmit. This
will, of course, vary from one moment to another.
- Accuracy - The percentage of all transmissions that
are transmitted and received correctly. We expect that this
will be less than 100%, but hope it will be close to 100%. Accuracy is
discussed on page 38, where we are told that WAN links are commonly
measured by a bit error rate (BER). It is expressed as 1
error per some number of bits. The text offers three WAN related
statistics:
- analog WAN - 1 in 105 (One in a hundred
thousand)
- digital WAN over copper - 1 in 106 (One in a
million)
- digital WAN over fiber - 1 in 107 (One in
ten million)
- The text also tells us that LANs are not usually
measured this way, because LANs use frames instead of packets. It
recommends that we find the number of bad frames in a given series of
bits, convert it to errors per million bits, and see if it exceeds the
standard for digital copper above.
- Collisions
- Ethernets suffer from frame collisionss when two nodes try to transmit
at the same time over a shared line. The text gives us some new terms
and troubleshooting advice.
Frames have 8-byte preamble
sections that are often the parts of the frames that collide. This kind
of collision is not tracked by
troubleshooting tools, probably because this is the way an Ethernet is
supposed to work.
When the collision occurs after the preamble, but still in the
first 64 bytes of the frame, we can call this a legal collision. The frame that
collides in this way is called a runt
frame. This seems to be because only a portion of the frame made
it through the network. Less than 0.1 percent of frames should be in this kind of collision.
If the collision takes place after the first 64 bytes of a frame, it is
called a late collision, which should "never happen". When it does, it
may be caused by the network being too large, by a faulty (slow)
repeater, or by one or more bad NICs.
When a station is in full duplex mode, it should not have a collision
either, but chapter 3 will examine this concept. The text suggests we
should look for a duplex mismatch if it should occur. This can occur
when using autonegotiation, or when someone configures one card for
half duplex and the other for full.
- Efficiency - the text discusses the idea that there
are harder and easier ways to send data, and efficiency is a measure of
how hard
our network works to send and receive data. We should look for too many
collisions happening, which will cause more retransmissions than should
be necessary.
Another problem is depicted in the illustration on page 40. Each frame includes a header, and each frame is trailed by a gap between it and the next frame. Headers and gaps are not
data, so the more small frames we use, the less efficient our network
is. Fewer, larger frames mean fewer headers and fewer gaps. It also
means more likely collisions, so we need to seek the best tradeoff in
frame size.
- Delay (latency) - the time between a "ready
to send" and "received"
- Delay variation - the amount of variance in
delay times on a network; The text tells us that this is called jitter,
but this word is also used to describe the actual delay. Jitter is not
noticeable in a standard file transfer, but is very noticeable in a
live stream of video or audio. The basic standard for wireless
communications is to keep jitter less than 5 milliseconds.
The author gives us some background in physics on page 41. We should
remember that all signals, whether wired or wireless, take some amount
of time to travel from one point to another. This is propagation delay. She gives us two standard measures for the speed of light through a vacuum, and reminds us that the speed of light (or electrons) through copper or fiber is about two thirds
of that standard. Two rules of thumb are offered. Figure 1 nanosecond
of delay for each foot of copper wire or fiber. Figure 1 millisecond of
delay for every 200 kilometers. (That is a little high, but it is a
usable approximate. Try the math, then explain the answer you get.)
The text describes serialization delay,
which is a measure of how much bandwidth
we are using. The text gives us the example of a T1 line, which has
a bandwidth of 1.544 Mbps, carrying a 1 KB file. That would take about
5 ms, which you may imagine as pouring a glass of water through a funnel.
Not so bad until you realize you need to pour a tanker truck full of
water through that funnel.
The text spends some space on packet-switching delay, which it explains
as the time it takes all the routers and switches along the route to
receive, store, process, and forward a packet. There are several
factors that affect this delay, including the type of RAM in the
device, the processor speed, and the number of choices the device must
select from.
Another concern in this section is queuing
delay. For those who may not know, a queue
is a line in which people or packets wait. The text warns us that increases
in network utilization increase the number of packets in queues exponentially.
See the figure on page 42. This gets more serious as the utilization
increases. The text recommends increasing the bandwidth of WAN circuits,
or using queuing algorithms that can prioritize packets that need to
be delivered faster.
- Response time - The amount of time between making a request
and receiving a response
to the request. The text cautions us that users will complain if
response time goes over 100 ms, and that TCP this time limit as a
cutoff time when waiting to retransmit a packet. With this in mind,
programmers and web developers should warn users when response times
will be higher than this threshold.
Security
The text spends a few pages on a subject we cover in several classes.
In the context of this chapter, security measures add to a network's cost,
but so do breaches. They slow worker productivity, but they make productivity
possible by protecting us from attacks. We should follow the basic plan
you should know by now: identify assets, analyze the risks, and develop
a security plan.
Manageability and usability
The text advises us to make sure we determine how our customer wants
to manage the network, and to make hardware and software decisions that
support these goals. This is sensible for customers who know how manage
network, or who have staff on hand who have reasonable requests. This
is not sensible if the customer has no background in network management,
or has no preferences.
The text tells us on the next page that we should also be concerned with
making the network easier for employees to use. Increasing usability is
not necessarily at odds with increasing manageability, but the text wants
to make sure we know that these two concepts serve two different parts
of your customer's employee population.
Adaptability and affordability
Another design goal that is given a short treatment is adaptability.
It is hard to predict the future, but the main idea is to choose technology
and equipment that will not tie you down to one vendor in the future,
or to one set of proprietary options. Networks change. They grow and they
shrink, they use new protocols and new hardware, but choices that are
compliant with industry standards will give you more ability to adapt
to the next change in the future. The text points out that it is more
common now than a few years ago to promote working remotely and working
from home. This does not mean we have to redo the entire network, but
it does mean we have to think about remote security, about VPN connections,
and about creating or increasing the ability of our network to allow remote
access.
Affordability is simply making the right financial decisions for your
customer. What those decisions are will depend a lot on what their network
needs to do, but there is good advice on pages 51 and 52 about buying things that
work with each other, things that are easy to manage and configure, and
things that have the capacity to handle more traffic and more users than
you currently have.
Tradeoffs
The text offers an interesting idea for a conversation with your client.
Since you are talking to him/her about all the topics in this chapter,
you should also make an effort to explain each topic, and to get the client
to prioritize each of them. This will allow you to consider which side
of an issue to emphasize when making choices about tradeoffs. The text
suggests that you assign a percentage to each concept, with the requirement
that they add up to 100. Consider this as a way to allocate the budget
for the project, or to determine which of the customer's goals are most
important.
The chapter ends with another checklist that summarizes the information
you should gather regarding the topics this chapter presented.
Chapter 3
Characterizing the network infrastructure
The
chapter begins with a quote from Abraham Lincoln which is meant to advise
us that we need to know where we are and where we want to be in order
to make choices about how to reach our goals. That being said, the point
of this chapter is to determine the current state of a network.
The text advises us to create a set of network maps
that show the locations of all major network components, segments, and
their names and addresses. We should compare the maps we can make with
data on network usage and performance to see where the network is
stressed, and where it is working well. The text goes into more detail
about this on page 60.
The author informs us that we could start mapping by creating maps of each location in a large network, but she seems to prefer an expanding map that supports the top-down concept.
- We start at a high level, creating a map that shows a general schematic of sites and WAN links.
- Each location in the high level map is then represented
with its own map, with the next logical level of detail, perhaps at the
level of Metropolitan Area Networks
- If the second level of detail was about MANs, then the next level of detail should show the LANs in each MAN.
- If we have just shown the LANs, we then break down each LAN, showing its components and structure.
- In
sufficiently large or complex LANs, we may want maps of each floor of
each building. This would be helpful for staff who are installing or
moving devices.
The
text describes making maps of services on the network, and lists
several common network services on page 61. It is a good idea to be
aware of the services that are on a network, and those that the project
requires that we add to the network. Making a map of such services may
not be the most useful way to track them, because users tend to move
about with laptops, tablets, and other portable devices. They require
the services they need wherever they might be, so a map may not help
you to understand the services are needed everywhere.
Skipping ahead to an example, the author shows us high level
diagrams of enterprise networks on pages 63 and 64. You can draw charts like
these with a service available through your email account. Sign on to your Baker email account, then click the Google Apps icon. Click More at the bottom of the list, and select the orange icon for Lucidchart Diagrams. You can make many kinds of charts, including network diagrams with Cisco symbols.
Back
to the text, the diagram on page 63 is a schematic that shows the
network connections to several locations from a central office in
Grants Pass, Oregon. It may be useful to look at these locations on a
map so we can recognize that the diagram is not drawn geographically.
It is drawn to show the equipment being used, and the connections to
each of the central and distant locations. In a set of top-down
diagrams, each of these locations would have another diagram of its
own, which would show the structure and services at that site.
The diagram on page 64 is a little harder to read. Each block in the diagram represents a function or service. Inside the block we see icons for the kind
of equipment providing the service, but we are not seeing how many
devices might actually be installed. We also see connecting lines
representing communication media, but none of the lines show bandwidth
details that were shown in the diagram on page 63.
The next several pages present more topics you should document about the network or plan in question.
- Addressing and naming
- The text makes some suggestions about naming standards to reflect the
location, device type, and/or service the device provides. IP
addressing is almost universal, so a logical addressing scheme and
method should be chosen that will allow scaling and subdividing as
needed. The author promises more about this in a later chapter.
- Wiring and network media
- The type and grade of cable used insidde and between buildings should
be documented. It may be helpful to know the terms listed in the text: vertical wiring runs from one floor to another, horizontal wiring runs from a wiring closet to a wallplate (which may be in the floor, or under it), and work-area wiring
runs from a wall plate to a host you are connecting to a network. The
author makes an odd observation about most wiring being assumed to be
less than 100 meters long. The general rule about Unshielded Twisted
Pair wiring is that it doesn't work if the total run from a host to a
network connectivity device is over 100 meters.
- Architectural and Environmental
constraints - It may not be possible to run network cable through an
area that your customer does not own. It is also possible that you
cannot run a cable if the site is protected by local laws, such as
being a historical site. These issues may lead to considering wireless
solutions for part of your network. The text also addresses supportive
services that your new or expanded network will require, such as air
conditioning, heating, ventilation, power, protection from EMI, and a
secure space for all the equipment.
- Wireless concerns -
A wireless network solves some problems but adds concerns that a wired
network does not have. The author summarizes several of them on pages
69 and 70. Note that a wireless signal will fade
over distance, as will a signal in a wire, but at a much faster rate.
If the signal encounters anything, the signal can be affected in
several different ways.
Checking the health of the existing internetwork
The purpose of this section is to take baseline measurements of the existing
network, so that you can tell whether your changes to the network introduce
improvements or problems. The text cautions us that we must also make
sure that improved performance is one of the customer's goals. If the
main objective is to reduce costs, a performance hit may be acceptable
if it is not too large.
The text offers several pages of ideas for measuring the state of the
network, but not a lot of detail on how to do most of it. We can get a
lot of the ideas by going through the Network Health Checklist on pages
83 and 84. Go over that list and we will discuss is on this week's discussion
board.
|