Chapter nine begins with a discussion on page 348 about data structures, and making a choice between file-oriented systems and relational database systems.
The text offers an example of a file-oriented system: a credit company might keep an ongoing file of transactions completed by its customers, perhaps storing a separate file for each day. There could be a batch process that runs "after hours" that aggregates the transactions for each account, and updates a customer file with the total change for each account. This system could work if there is such a thing as "after hours", which is unlikely in a world that offers goods for sale on the Internet. It would also require that we would be satisfied with updating the customers' records at the end of a chosen time period instead of completing live updates as the transactions are made.
The text proposes that this older system approach could support a company that chose to or needed to keep the records of each division of the company separate from the others. This compartmentalization presents a problem: redundant data elements probably exist, in that each division would need to keep records on every customer. This is wasteful in terms of data storage needs and costs. It also leads to data integrity problems: an update to a customer's record (change in name, address, balance, etc.) would need to be made by each division if separate file stores exist. It is quite likely that the data will be out of date or otherwise incorrect unless each division is notified when any division becomes aware of a request to change stored data. This invites inconsistent data into our system. The text is also critical of the fact that each data file is likely to be designed differently from the others, which would lead to difficulty in retrieving data from more than one file in a single query.
These are good reasons for the data industry to have moved on to more modern, flexible, and resilient data structures. It makes sense to move on to the kind of system you are more likely to build and support.
On page 350, the text lists several arguments about why database management systems (DBMSs) are more useful that file-oriented systems. This list may be easier to understand:
The text gives us a list of components found in most DBMSs, most of which are familiar:
The text also presents some thoughts about web-based data design. It should be an assumption that any commercial entity will want to have a presence on the Internet, and most non-commercial entities will want one as well.
Let's move ahead to page 356, where the text presents more details about databases. First we enjoy some basic terms.
Normalization is the process of making your data tables comply with several rules. Compliance with each successive rule takes the tables to the next level of normalization. The levels are called normal forms, specifically first, second, third, and fourth normal form. The forms can also be referred to by the symbols 1NF, 2NF, 3NF, and 4NF.
The first objective for understanding normalization should be to define functional dependence.
Suppose we had a database about customers who live in either Canada or the United States. Let's have a column in the customer table called State/Province, and another column called Country. In case you are not aware of it, all the states and provinces have unique names.
So the Country column is functionally dependent on State/Province. Knowing the value in the State/Province column allows us to predict the associated value in the Country column. We can also say that the State/Province column functionally determines Country. Note that the relationship does not have to flow both ways: knowing a customer's country of residence will not determine their state/province of residence. This situation makes State/Province a key field for this table.
In small tables, it is tempting to find dependencies that may not be real. Assume the table above of sales people with three fields: RepNum. LastName, and FirstName. If we have only three sales people in the Rep table, we could conclude that RepNum (a field listing employee numbers) determines LastName, or that LastName determines RepNum. The key is to ask about the future, and to imagine what can and may happen. Hiring a new sales person with the same last name as a current sales rep should tell us that RepNum will determine LastName, but LastName will not determine any other column.
RepNum, in fact, determines every other field in the Rep table, since each sales person has a unique number. This makes RepNum a key for the Rep table. It is in fact the primary key for this table.
The rule for primary keys states that a column is the primary key if two things are true:
This means that a primary key must determine all other columns. It is possible that you need a combination of two columns to define a primary key.
Candidate keys are those columns that could be chosen as the primary key. It is possible that more than one candidate key exists in a table, for instance, a table that holds both employee ID numbers and Social Security numbers could use either column as a primary key. Any candidate key that is not chosen to be the primary key is designated an alternate key.
Okay, with that understood, let's talk about normalization. The definition of first normal form says that a table may not contain repeating groups. This means that cells in a table may only contain single values. A standard cure for repeating groups is to split the records with repeating groups into multiple records. It may also be necessary to expand the primary key (include another column in the key).
The second problem with repeating groups is that the column representing the primary key cannot determine the value of the field containing the repeating groups. This becomes obvious when you expand such a row, making it multiple rows containing the same value for the primary key. Consider the table below:
The Class and Room columns contain repeating values. The Instructor column is meant to be the primary key, but it does not meet the rule. We can fix this by expanding the primary key, adding a column for the second part of the new primary key, and splitting the values of the repeating groups into separate rows.
In the expanded table, the new primary key is the combination of the Instructor and Day columns. The combination determines the contents of the remaining columns. This table is now in first normal form.
Second normal form says that a table is in first normal form, and it has a new quality:
Let's say that another way: if the primary key is made of more than one column, the non-key columns in the table cannot be determined by a mere portion of primary key. In the example above, the rule requires that Class and Room cannot be determined by Instructor or Day alone. This rule is met, so this table is in second normal form.
Note that since this rule describes tables whose primary key is a combination of columns, any table in first normal form that has a single column as its primary key is automatically in second normal form.
So, how do you find and fix tables that are not in 2NF? You examine tables for dependencies. Which columns are determined by which other columns? How much data in the table is repeated from one row to another? You use the same techniques when imposing Third normal form, so let's think about it for a minute.
Third normal form lets us use another term from the text.
The procedure to cure a violation of this rule is simple: we copy the non-candidate determinant into a new table and move the columns that it determines to the new table. The non-candidate determinant becomes the primary key of the new table. It continues to exist in the old table, relating the two tables.
Problems are created when the correction procedures are not done correctly. Students should examine the examples in the text. Note that when the solution is to decompose a table into two or more tables, each resultant table must be in 3NF when you are done.
Consider this example from About.com:
Some of the same address information is saved for multiple customers. That could be a problem when updating the database. Assume that we have lots of customers in each of a dozen cities. There is the potential for entering an error each time the same data element is entered. We can simplify data entry by removing the columns that are dependent on a single nonkey column and creating a new table. Since ZIP determines City and State, we remove the City and State columns from the table, and create a new table containing ZIP, City, and State.
This allows us to simplify the original table:
The two resultant tables are related on the ZIP column. (Using color theory, I have related the blue table above to the yellow table below on the green column.) When we need the City and State information for a customer, we look it up in the ZIP table, where each combination is stored only once.
When we place our data tables in 3NF, we avoid some classic problems found in tables that are not in first, second, or third normal form:
Chapter 10 is like a lesson on networking. The main material begins on page 425. Much of it is outside the scope of this class, despite the authors' efforts to include this material.
The text makes the point that some network planning decisions should be made at high levels. It should not be left to branch offices to decide whether or not to use client server networks, whether to use the same software as the rest of the enterprise, or whether to connect their offices to each other in the same way that has been used throughout the rest of the enterprise. Decisions like these should be made for the entire enterprise, as should decisions on upgrades to such components.
Much of the material in this chapter should have been considered as constraints in your early planning and analysis phases. If your system design did not already account for the network infrastructure in which it must exist, your plan is in trouble.
It may be appropriate to consider whether you will use a thin or thick client design. Thin clients put most of the burden on a server, presenting only an interface on the user's workstation, and performing the real computing in a session on a server. Thick clients (also called fat clients) put much of the processing on the workstation, and mainly save data on a server. Some questions you must ask to make this choice:
The text spends several pages on network topology and protocols. This material is more properly covered in a class about networking.