Chapter 7: Data Coding

CIS 361 Data Communications and Networks

Chapter 7: Data Coding

Objectives:

This short chapter introduces concepts about code systems used to transmit data. The objectives important to this chapter are:

explain what a code is
understand several different code systems
describe the function of control characters
describe methods of error checking and compression

Concepts:

The most basic idea about computers in this chapter is that they are binary devices, essentially collections of switches that may be off or on. A bit is a binary digit, and it can only represent those two states.

To communicate more complicated information than off and on, we need codes, sets of symbols that can be transmitted across data lines. The evolution of data codes is discussed, starting on page 250. Samuel F. B. Morse created Morse Code, which was useful for telegraphs and human operators, but is not very good for machine transmission.

Computer systems do better when the code used meets the four criteria listed on page 251:

True
binary code - Morse is not, in that it is long and short pulses
All characters use the same number of bits - Morse does not
All bits are perfectly formed - there is too much human error in telegraphy
All bits are the same duration - this makes it easier for the machines to know when to listen

The number of bits in a symbol, assuming the bits are just on and off, determines how many unique symbols your code system can represent. The chart on page 251 summarizes this. If the bit has two possible states, then the number two raised to the power of the number of bits is the number of permutations (unique possibilities) possible in that system. Your book refers to these unique combinations as code points. At this point, some students will observe that the English alphabet only has twenty six letters, so five bits might be enough. It would be, if we did not care about numerals, punctuation, capitals and lower case, and the concepts on page 252. Three types of character assignments, meanings for code points, are listed:

alphanumeric characters are the actual printed characters of a language
format effectors are characters that change how text looks, such as spaces, line returns, tabs and others.
control characters are commands to a device to do something, such as change color or start a new page, or markers that have to do with processing signals (like "start of text" or "end of transmission")

In addition, bits can be added to symbols to help check for errors. Parity bits are an example. Assume that a code system uses seven bits for each symbol. Every unique combination will have either an even or odd number of bits set to 1. So we can attach a parity bit that tells the receiver whether each symbol is supposed to have an even or odd number of 1s in it. In even parity systems, if the number of 1s in the first seven bits is already even, we add a 0 as the parity bit. If the number of 1s were odd, we would add a 1 (to make an even number of them) as the parity bit. Odd parity systems are the reverse: you add a 1 to make odd numbers of 1s, you add a 0 to keep odd numbers of 1s. The reason you do it is to help the receiver decide if the symbol it sees is the symbol that was actually transmitted.

On page 253, the idea of Escape codes is introduced. This is compared to the idea of the Shift key on a typewriter. In the same way that a Shift key changes the meaning of the next symbol, an Escape code changes the meaning of symbols in a code system. This effectively gives two meanings to most symbols, without having to double the size of the code table. The drawback is that the software reading the code has to watch out for such characters.

Specific codes are discussed, beginning on page 254, the first being the real Baudot code, invented by Emil Baudot, and the second being the code called Baudot, invented by Donald Murray. The book apologizes, then proceeds to discuss Murray's code, using the common misnomer for it, Baudot. I will refer to Murray's International Alphabet No. 2 as Baudot here.

Baudot code, as generally used, is a five bit code, with no parity. It is still used in teletypes, telegraphs and telex equipment, even though it is outmoded.

ASCII is a major code system. It is the American Standard Code for Information Interchange, developed by the American National Standards Institute, a U.S. standards organization. It works very well for English, but has some drawbacks. ASCII is a seven bit code, which becomes eight bits if a parity bit is used or if the extended version is used. Seven bits gives us 128 characters, illustrated on page 256. Using the chart on this page, and find the capital "A". The seven bits for this symbol are 100 (from the column heading) and 0001 (from the row heading). This is the binary equivalent for the decimal number 65. Most common English letters and symbols are represented here, but not all. A second chart of 128 more characters forms the extended ASCII table, when using eight bit ASCII. Notice that the bits in this chart are numbered, and that they are numbered in descending order from left to right. For the letter "A", bit 7 is a 1, bit 6 is a 0 and bit 5 is a 0. (The "100" noted above.) When using extended ASCII, every symbol has eight bits. Every symbol in the chart on page 256 would have a 0 as its eighth bit.

Another major code is shown on page 257. It was invented by IBM for its mainframe systems, and has the horrible name EBCDIC, which stands for Extended Binary Coded Decimal Interchange Code. (Can you tell that the IBM marketing department was not its best department?) To be incredibly different, EBCDIC is an eight bit code and the bits are numbered in the reverse order from ASCII numbering. This system is not used on any personal computer, only on mainframes.

Unicode is discussed on page 256. This is the biggest one yet. Unicode was created by a consortium of computer and software companies, and is a sixteen bit system. This gives us 65,536 possible symbols. Why so many? Earth is a large planet in terms of languages. 128 or 256 symbols is nothing when compared to the number of symbols needed for roman alphabets, Cyrillic alphabets, Asian alphabets, symbolic languages than do not even use true alphabets, and so on. Sixteen bits is about enough to cover the needs of a computer system capable of communicating with anyone on the planet.

On page 259, the author discusses Control Characters, and lists several common ones. The names of these characters are often acronyms, such as SOH for Start of Header, and EOT for End of Text. Some are just abbreviations, like ACK for Acknowledgment, and NAK for Negative Acknowledgment.

The concept of code efficiency is discussed briefly on page 260. This strikes me as a bit of an accountant's ruse. The author explains that an efficient code is one that spends most of its characters passing information, not processing overhead like error trapping. As far as it goes, he is correct, but I wonder how "efficient" a code might be considered if we have to keep retransmitting messages over and over because it has no error traps?

When passing information from one computer system to another, one or both must often convert the messages from one code system to another. For instance, sending a message from an EBCDIC terminal to an ASCII PC requires at least one conversion. In ongoing transactions, it is likely that many conversions must take place. It is part of the job of network software to work out which side will translate, one or both, and what common languages each side speaks.

Data Compression (or Compaction) is the next topic. Three main schemes for sending fewer bits across wires are listed on page 262:

Character Compression - the software detects commonly sent characters, and decides to send short representations of them. This must be agreed upon by sender and receiver to work.
Run Length Encoding - common repeated characters or phrases can be represented by short code groups. Again, both sides must agree on the definitions on these short code groups.
Character Stripping - control characters are removed at the sending end, and replaced at the receiving end.

The last major topic of the chapter is encryption. So far, we have assumed that the sender and receiver, and the rest of the world all understand the code system being used. Sometimes, however, privacy and secrecy require that a transmission must not be understood by intermediate transmitters of the signal. This applies to intended and unintended listeners. Take a web based credit card purchase, for example. You may not care too much if an eavesdropper knows you are shopping, but you certainly don't want them to have your credit information. Several encryption methods are mentioned, but this is a rapidly changing topic, and I would like to suggest it is a good one for project research.