Chapter 2, Analyzing Technical Goals and Tradeoffs; Chapter 3, Characterizing the Existing Internetwork

NET 226 - Designing Internetwork Solutions

Chapter 2, Analyzing Technical Goals and Tradeoffs; Chapter 3, Characterizing the Existing Internetwork

Objectives:

This lesson discusses the technical goals of network design and conflicting goals that lead to tradeoffs. Objectives important to this lesson:

Scalability
Availability
Performance
Security
Manageability and usability
Adaptability and affordability
Tradeoffs
Characterizing the network infrastructure
Checking the health of the existing internetwork

Concepts:

Chapter 2

Scalability

Scalability is the ability of a network design to adapt to growth. This can also mean adapting to downsizing, but people usually mean the ability to perform well when the organization adds users, locations, customers, services, or other demands that require more from the network. The text presents some standard questions to ask a client on page 26. They ask the client to estimate the number of additional sites, users, and servers that will be needed in the next year, two years, and five years.

The text discusses an evolution in computer networks on page 26. It tells us that the computer business began with centralized data storage and computing in the era of mainframes, and changed in the 1980s and 90s to follow a decentralized model. Decentralization began with improvements in processing power that let users have powerful workstations and administrators have powerful servers that could handle their needs locally. This led to the rule that 80% of your traffic should stay on your LAN, and 20% of your traffic may need to go to other LANs or more distant networks.

This was a good model, but it changed when the Internet became popular, and people got the idea that you could reach any computer anywhere in the world across a mythical superhighway of information. Myths aside, companies decided to have fewer data centers, and more connectivity to them from remote sites. This has advantages in terms of fewer servers, fewer administrators for them, and more consolidation of data. Note the list of goals to be achieved in this kind of scenario provided on page 27:

Make separate LANs part of the corporate intranetwork
Solve bottleneck problems caused by all users passing traffic across LAN and WAN links
Centralize data servers and centers (Note that this does not mean they are geographically central.)
If you are using mainframes, make them part of the IP network to share data
Support users in the field and regular telecommuting
Support secure connections, when needed, with customers, vendors, and business partners

Each of these these goals is affected by the scale of the network, and assumes that we can scale it up at any time.

Availability

What percentage of the time does our network need to be available? The text offers a simple math problem on page 27 to begin the discussion. Assume a business that does not close, such as an online vendor. They are open 24 hours a day, 7 days a week. Multiply those numbers and you should get 168 hours a week. How much downtime can we afford to have? In the example in the text, the network is up for 165 hours per week. That means that the network is up 98.214 percent of the time. It also means it is down for three hours a week. All at once, or spread out a bit? It probably makes a difference. What do we mean by up?

The text tries to clarify the meaning of availability, but seems to muddle it more with other related words. Let's try a list, showing them together with short definitions.

availability - the amount of time that a network is operational, compared to the amount of time it is meant to be operational
reliability - a more general concept that is reduced by errors, equipment failures, and inaccurate performance; a network might be available, but not reliable
resiliency - the ability to operate under stress or heavy loads; the ability to continue to work or to be easily restored when there is a disaster
redundancy - having more than one way of doing something; redundancy can contribute to a network being more available, reliable, or resilient, because it anticipates something going wrong and provides another way to handle the need

The text briefly discusses disaster recovery on pages 28 and 29. We are advised to determine which parts of our network are critical to its being functional, and to determine how to work around the failure of those parts in a disaster. As noted above, we need to decide what we must have, what we can do without, and what we are going to do when the parts we must have fail, cannot function, or are destroyed. The text advises us to prioritize our needs, to establish backup devices and data stores, and to test the solutions we choose to enable. Note that this text is referring to the processes that take place during and after the disaster as disaster recovery. Other texts we have seen break this into two parts: business continuity (while the disaster is still taking place) and disaster recovery (when we clean up and try to return to normal after the disaster is over).

Returning to availability, the text again considers measuring the percentage of time a system is available. Customers tend to assume that a system will be available at all times. They need to be made aware that some downtime should be expected, and how much. As noted above, a customer may care about how long a system is down, but they may care as much about when the downtime occurs and why it occurs.

Does the expected downtime represent scheduled maintenance, or does it represent the rate at which this system regularly fails? The difference tells us whether the customers need to adjust their expectations or the network administrators need to correct a problem.
For a company that never closes, there is no good time for scheduled maintenance, but there will still be times that are better than others. Pick times when the system is not busy, or make a case for more redundancy that would allow parts of the system to be off line for the expected maintenance.
Is the reported down time an average or a total that misrepresents the behavior? If the system is down for a total of an hour each month, is that a few seconds at a time or an hour all at once?
What number and what measurement are most important to the customer? We should specify both up time and down time, but we should do so in units that mean something to the customer and that tell the truth about the system's performance.

In the example on page 29, the system in question has 99.70 percent up time. The text ponders whether this is all at once or randomly distributed. Either condition could be reported as an average of 30 minutes per week, or 10.70 seconds per hour of down time. The second measure does not sound so bad, but they are only two ways of looking at the same measure, neither of which may be telling us what we want to know. The customers want to know how long they will have to wait when the system is down, and how often this happens.

Some systems must have extremely low down times and extremely high up times. The text introduces the idea of five nines up time on page 30. This means the system must be up 99.999 percent of the time. The text explains that this is a down time of about 5 minutes per year. (Just over 5.25 minutes per year, actually.) Follow the link above to an article on Wikipedia to see this level of service compared to lower and higher levels. We are cautioned that five nines had better not include scheduled maintenance time, and even then it may not be possible, unless we can do maintenance while the system is running. This level of up time sounds desirable, but it is not attainable without high levels of redundancy that the customer may not be able to cover in a reasonable budget. The text suggests that the level of redundancy required might be acceptable if the service is provided to multiple clients at once using the same hardware. This is what the text means by "collocation centers", locations at which the hardware and software provide failover service to multiple clients.

More new terms are used in the discussion that begins on the bottom of page 31.

MTBF - Mean Time Between Failure is a statistical phrase that usually has to do with the average number of hours a device can be expected to run before it fails. (A mean is what most people think of as an average.)
MTTR - Mean Time To Repair makes sense if you understand MTBF already. This is the average time it takes to repair a device or service once it has failed.
MTBSO - Mean Time Between Service Outage is sometimes used instead of MTBF when we are talking about the average time between service failures instead of device failures.

The text explains on the next page that if we have values for MTBF and MTTR, we can calculate availability as:

Availability = MTBF / (MTBF + MTTR)

A problem with this concept is that this works well for a single product because we would rely on statistical data from the manufacturer or from an unbiased source like Underwriters Laboratories. A network is not a single product. It is a compilation of many devices and services, and we should not rely on the data about any one of those devices as a definitive measure of that network. You may not be able to construct data about a network because it will change many times in the span of time required to collect such data. This approach is best used for individual devices and components with known values.

Performance

The author tells us that we should consider our discussion of performance in the light of the plans for expansion that we gathered from the customer in the first chapter. We should not put undue effort into analyzing the performance of a system we are about to change, but we should understand it before we change it, which will be addressed in the next chapter.

On page 33, we see a list of terms relating to performance:

Capacity - the theoretical data-carrying capability of a circuit or network; may be measured in bits per second (bps) or by some multiple (e.g. Mbps)
Utilization -The percentage of the capacity that is in use. The text recommends that the best utilization to seek is 70%, which allows for bursts and peaks of activity above that level. It also warns us that the expected utilization for links between computers and switches is lower than for links between switches, routers, servers, and other network bottlenecks. Those links are expected to handle more traffic, so their bandwidth should be designed to carry more traffic.
Optimum utilization - the percentage of utilization just below saturation
Saturation - The text does not define saturation, which is the state at which the network or circuit can handle no additional traffic.
Throughput - The quantity of error free data successfully transferred between nodes in a specific time interval.
Note the difference between this concept and capacity, listed above.
The text compares throughput to bandwidth on page 35, making the point that we can put thousands of packets down a wire, but if only a few hundred are usable, they are the only ones that count. Another aspect that limits throughput is mentioned on that page. The access method used in most networks requires that nodes wait when the network is very busy.

The text discusses testing that is often done for internetworking devices on page 36. When the test results are expressed in packets per second, you should know that the tests are often done with small packets and multiple streams through the device on multiple ports. That is a normal function of the device, but it is not normal for one user to push data in that manner. The numbers obtained in this sort of test are artificially inflated, so the bottom line is that we cannot rely on marketing material for testing. We should do our own, based on the needs or characteristics of our network.

Throughput can also be measured by data pushed by applications. The problem with this is explained on page 37, that the "throughput" being measured can include overhead and retransmissions. This is not what throughput is supposed to mean, but it is how it can be measured, which borders on being dishonest about your device's stats. A proper measurement would be to measure how long it takes to push a large collection of data, without errors. This is not to say that there cannot be errors or retransmissions. The text means that we should, for example, determine how long it actually takes to get a good duplicate of the data collection, not how long it takes to push a specific number of packets or frames through the network. On the bottom of page 37, there is a list of factors that affect data transmission rates. Some, like frame size, can be modified, but they may also be modified dynamically by network protocols as conditions on the network change.
Offered load - the total of all the bits that all the nodes on the network are ready to transmit. This will, of course, vary from one moment to another.
Accuracy - The percentage of all transmissions that are transmitted and received correctly. We expect that this will be less than 100%, but hope it will be close to 100%. Accuracy is discussed on page 38, where we are told that WAN links are commonly measured by a bit error rate (BER). It is expressed as 1 error per some number of bits. The text offers three WAN related statistics:
- analog WAN - 1 in 10⁵ (One in a hundred thousand)
- digital WAN over copper - 1 in 10⁶ (One in a million)
- digital WAN over fiber - 1 in 10⁷ (One in ten million)
- The text also tells us that LANs are not usually measured this way, because LANs use frames instead of packets. It recommends that we find the number of bad frames in a given series of bits, convert it to errors per million bits, and see if it exceeds the standard for digital copper above.
Collisions - Ethernets suffer from frame collisionss when two nodes try to transmit at the same time over a shared line. The text gives us some new terms and troubleshooting advice.

Frames have 8-byte preamble sections that are often the parts of the frames that collide. This kind of collision is not tracked by troubleshooting tools, probably because this is the way an Ethernet is supposed to work.

When the collision occurs after the preamble, but still in the first 64 bytes of the frame, we can call this a legal collision. The frame that collides in this way is called a runt frame. This seems to be because only a portion of the frame made it through the network. Less than 0.1 percent of frames should be in this kind of collision.

If the collision takes place after the first 64 bytes of a frame, it is called a late collision, which should "never happen". When it does, it may be caused by the network being too large, by a faulty (slow) repeater, or by one or more bad NICs.

When a station is in full duplex mode, it should not have a collision either, but chapter 3 will examine this concept. The text suggests we should look for a duplex mismatch if it should occur. This can occur when using autonegotiation, or when someone configures one card for half duplex and the other for full.
Efficiency - the text discusses the idea that there are harder and easier ways to send data, and efficiency is a measure of how hard our network works to send and receive data. We should look for too many collisions happening, which will cause more retransmissions than should be necessary.

Another problem is depicted in the illustration on page 40. Each frame includes a header, and each frame is trailed by a gap between it and the next frame. Headers and gaps are not data, so the more small frames we use, the less efficient our network is. Fewer, larger frames mean fewer headers and fewer gaps. It also means more likely collisions, so we need to seek the best tradeoff in frame size.
Delay (latency) - the time between a "ready to send" and "received"
Delay variation - the amount of variance in delay times on a network; The text tells us that this is called jitter, but this word is also used to describe the actual delay. Jitter is not noticeable in a standard file transfer, but is very noticeable in a live stream of video or audio. The basic standard for wireless communications is to keep jitter less than 5 milliseconds.

The author gives us some background in physics on page 41. We should remember that all signals, whether wired or wireless, take some amount of time to travel from one point to another. This is propagation delay. She gives us two standard measures for the speed of light through a vacuum, and reminds us that the speed of light (or electrons) through copper or fiber is about two thirds of that standard. Two rules of thumb are offered. Figure 1 nanosecond of delay for each foot of copper wire or fiber. Figure 1 millisecond of delay for every 200 kilometers. (That is a little high, but it is a usable approximate. Try the math, then explain the answer you get.)

The text describes serialization delay, which is a measure of how much bandwidth we are using. The text gives us the example of a T1 line, which has a bandwidth of 1.544 Mbps, carrying a 1 KB file. That would take about 5 ms, which you may imagine as pouring a glass of water through a funnel. Not so bad until you realize you need to pour a tanker truck full of water through that funnel.

The text spends some space on packet-switching delay, which it explains as the time it takes all the routers and switches along the route to receive, store, process, and forward a packet. There are several factors that affect this delay, including the type of RAM in the device, the processor speed, and the number of choices the device must select from.

Another concern in this section is queuing delay. For those who may not know, a queue is a line in which people or packets wait. The text warns us that increases in network utilization increase the number of packets in queues exponentially. See the figure on page 42. This gets more serious as the utilization increases. The text recommends increasing the bandwidth of WAN circuits, or using queuing algorithms that can prioritize packets that need to be delivered faster.
Response time - The amount of time between making a request and receiving a response to the request. The text cautions us that users will complain if response time goes over 100 ms, and that TCP this time limit as a cutoff time when waiting to retransmit a packet. With this in mind, programmers and web developers should warn users when response times will be higher than this threshold.

Security

The text spends a few pages on a subject we cover in several classes. In the context of this chapter, security measures add to a network's cost, but so do breaches. They slow worker productivity, but they make productivity possible by protecting us from attacks. We should follow the basic plan you should know by now: identify assets, analyze the risks, and develop a security plan.

Manageability and usability

The text advises us to make sure we determine how our customer wants to manage the network, and to make hardware and software decisions that support these goals. This is sensible for customers who know how manage network, or who have staff on hand who have reasonable requests. This is not sensible if the customer has no background in network management, or has no preferences.

The text tells us on the next page that we should also be concerned with making the network easier for employees to use. Increasing usability is not necessarily at odds with increasing manageability, but the text wants to make sure we know that these two concepts serve two different parts of your customer's employee population.

Adaptability and affordability

Another design goal that is given a short treatment is adaptability. It is hard to predict the future, but the main idea is to choose technology and equipment that will not tie you down to one vendor in the future, or to one set of proprietary options. Networks change. They grow and they shrink, they use new protocols and new hardware, but choices that are compliant with industry standards will give you more ability to adapt to the next change in the future. The text points out that it is more common now than a few years ago to promote working remotely and working from home. This does not mean we have to redo the entire network, but it does mean we have to think about remote security, about VPN connections, and about creating or increasing the ability of our network to allow remote access.

Affordability is simply making the right financial decisions for your customer. What those decisions are will depend a lot on what their network needs to do, but there is good advice on pages 51 and 52 about buying things that work with each other, things that are easy to manage and configure, and things that have the capacity to handle more traffic and more users than you currently have.

Tradeoffs

The text offers an interesting idea for a conversation with your client. Since you are talking to him/her about all the topics in this chapter, you should also make an effort to explain each topic, and to get the client to prioritize each of them. This will allow you to consider which side of an issue to emphasize when making choices about tradeoffs. The text suggests that you assign a percentage to each concept, with the requirement that they add up to 100. Consider this as a way to allocate the budget for the project, or to determine which of the customer's goals are most important.

The chapter ends with another checklist that summarizes the information you should gather regarding the topics this chapter presented.

Chapter 3

Characterizing the network infrastructure

The chapter begins with a quote from Abraham Lincoln which is meant to advise us that we need to know where we are and where we want to be in order to make choices about how to reach our goals. That being said, the point of this chapter is to determine the current state of a network.

The text advises us to create a set of network maps that show the locations of all major network components, segments, and their names and addresses. We should compare the maps we can make with data on network usage and performance to see where the network is stressed, and where it is working well. The text goes into more detail about this on page 60.

The author informs us that we could start mapping by creating maps of each location in a large network, but she seems to prefer an expanding map that supports the top-down concept.

We start at a high level, creating a map that shows a general schematic of sites and WAN links.
Each location in the high level map is then represented with its own map, with the next logical level of detail, perhaps at the level of Metropolitan Area Networks
If the second level of detail was about MANs, then the next level of detail should show the LANs in each MAN.
If we have just shown the LANs, we then break down each LAN, showing its components and structure.
In sufficiently large or complex LANs, we may want maps of each floor of each building. This would be helpful for staff who are installing or moving devices.

The text describes making maps of services on the network, and lists several common network services on page 61. It is a good idea to be aware of the services that are on a network, and those that the project requires that we add to the network. Making a map of such services may not be the most useful way to track them, because users tend to move about with laptops, tablets, and other portable devices. They require the services they need wherever they might be, so a map may not help you to understand the services are needed everywhere.

Skipping ahead to an example, the author shows us high level diagrams of enterprise networks on pages 63 and 64. You can draw charts like these with a service available through your email account. Sign on to your Baker email account, then click the Google Apps icon. Click More at the bottom of the list, and select the orange icon for Lucidchart Diagrams. You can make many kinds of charts, including network diagrams with Cisco symbols.

Back to the text, the diagram on page 63 is a schematic that shows the network connections to several locations from a central office in Grants Pass, Oregon. It may be useful to look at these locations on a map so we can recognize that the diagram is not drawn geographically. It is drawn to show the equipment being used, and the connections to each of the central and distant locations. In a set of top-down diagrams, each of these locations would have another diagram of its own, which would show the structure and services at that site.

The diagram on page 64 is a little harder to read. Each block in the diagram represents a function or service. Inside the block we see icons for the kind of equipment providing the service, but we are not seeing how many devices might actually be installed. We also see connecting lines representing communication media, but none of the lines show bandwidth details that were shown in the diagram on page 63.

The next several pages present more topics you should document about the network or plan in question.

Addressing and naming - The text makes some suggestions about naming standards to reflect the location, device type, and/or service the device provides. IP addressing is almost universal, so a logical addressing scheme and method should be chosen that will allow scaling and subdividing as needed. The author promises more about this in a later chapter.
Wiring and network media - The type and grade of cable used insidde and between buildings should be documented. It may be helpful to know the terms listed in the text: vertical wiring runs from one floor to another, horizontal wiring runs from a wiring closet to a wallplate (which may be in the floor, or under it), and work-area wiring runs from a wall plate to a host you are connecting to a network. The author makes an odd observation about most wiring being assumed to be less than 100 meters long. The general rule about Unshielded Twisted Pair wiring is that it doesn't work if the total run from a host to a network connectivity device is over 100 meters.
Architectural and Environmental constraints - It may not be possible to run network cable through an area that your customer does not own. It is also possible that you cannot run a cable if the site is protected by local laws, such as being a historical site. These issues may lead to considering wireless solutions for part of your network. The text also addresses supportive services that your new or expanded network will require, such as air conditioning, heating, ventilation, power, protection from EMI, and a secure space for all the equipment.
Wireless concerns - A wireless network solves some problems but adds concerns that a wired network does not have. The author summarizes several of them on pages 69 and 70. Note that a wireless signal will fade over distance, as will a signal in a wire, but at a much faster rate. If the signal encounters anything, the signal can be affected in several different ways.

Checking the health of the existing internetwork

The purpose of this section is to take baseline measurements of the existing network, so that you can tell whether your changes to the network introduce improvements or problems. The text cautions us that we must also make sure that improved performance is one of the customer's goals. If the main objective is to reduce costs, a performance hit may be acceptable if it is not too large.

The text offers several pages of ideas for measuring the state of the network, but not a lot of detail on how to do most of it. We can get a lot of the ideas by going through the Network Health Checklist on pages 83 and 84. Go over that list and we will discuss is on this week's discussion board.

Week 2 Assignment: Chapters 2 and 3

From chapters 2 and 3:

Study the Technical Goals Checklist on page 54 and bring any related personal experiences to class.
Study the Network Health Checklist on pages 83-84.
Chapter 2: page 56, Questions 2 and 3
Chapter 3: page 85, Hands-on project, Questions 1-4.

Read Chapter 4