CIS 331: Database Management Using SQL

Concepts of Database Management, Chapter 5:
Database Design 1: Normalization

Objectives:
 

This chapter introduces the student to several formal database concepts. The objectives important to this chapter are:

  1. Functional dependence
  2. Primary keys
  3. First, second, and third normal form
  4. Problems avoided with normalization
  5. Problems created by failure to normalize
  6. Fourth normal form
  7. Problems with not having fourth normal form
  8. How to use normalization when designing
Concepts:

Normalization is the process of making your data tables comply with several rules. Compliance with each successive rule takes the tables to the next level of normalization. The levels are called normal forms, specifically first, second, third, and fourth normal form. The forms can also be referred to by the symbols 1NF, 2NF, 3NF, and 4NF.

The first objective of the chapter is to define functional dependence. A column is functionally dependent on another column if the value found in the second column determines the value found in the first column. Suppose we had a database about customers who live in either Canada or the United States. Let's have a column in the customer table called State/Province, and another column called Country. If the value in State/Province for a customer is Michigan, the value in Country for that customer will be United States. If the value in State/Province is Ontario, the value in Country will be Canada. So the Country column is functionally dependent on State/Province. Likewise, the State/Province column functionally determines Country. Note that the relationship does not have to flow both ways: knowing a customer's country of residence will not determine their state of residence.

In small tables, it is tempting to find dependencies that may not be real. The text suggests that since we have only three sales people in the Rep table, we could conclude that RepNum determines LastName, and LastName determines RepNum. The key is to ask about the future, and to imagine what can and may happen. Hiring a new sales person with the same last name as a current sales rep shows that RepNum will determine LastName, but LastName will not determine any other column.

RepNum, in fact, determines every other field in the Rep table, since each sales person has a unique number. This makes RepNum a key for the Rep table. It is in fact the primary key for this table.

The rule for primary keys states that a column is the primary key if two things are true:

  • All columns are functionally dependent on the primary key
  • No subset of columns in the primary key functionally determines all other columns

This means that a primary key must determine all other columns. It also means that if the primary key consists of more than one column, no portion of the primary key may act as a primary key. If the subset portion of the rule is violated, it means that you have not identified the correct primary key.

Candidate keys are those columns that could be chosen as the primary key. It is possible that more than one candidate key exists in a table, for instance, the book suggests that a table that holds both employee ID numbers and Social Security numbers could use either column as a primary key. Any candidate key that is not chosen to be the primary key is designated an alternate key.

The definition of first normal form says that a table may not contain repeating groups. This was included in earlier chapters as the statement that cells in a table may only contain single values. A standard cure for repeating groups split the records with repeating groups into multiple records.

It may also be necessary to expand the primary key. The second problem with repeating groups is that the column representing the primary key does not determine the value of the field containing the repeating groups. This becomes obvious when you expand such a row, making it multiple rows containing the same value for the primary key. Assume the following table:

Instructor Class Room
Vincent CIS107a
CIS331
S107
S102
Scott CIS302a
CIS331
D204
D204

The Class and Room columns contain repeating values. The Instructor column is meant to be the primary key, but it does not meet the rule. We can fix this by expanding the primary key, adding a column for the second part of the new primary key, and splitting the values of the repeating groups into separate rows.

Instructor Day Class Room
Vincent Wednesday CIS107a S107
Vincent Thursday CIS331 S102
Scott Tuesday CIS302a D204
Scott Wednesday CIS331
D204

In the expanded table, the new primary key is the combination of the Instructor and Day columns. The combination determines the contents of the remaining columns. This table is now in first normal form.

Second normal form says that a table is in first normal form, and it has a new quality: No column that is not part of the primary key is determined by a subset of the primary key. Let's say that another way: if the primary key is made of more than one column, no column in the table can be determined by a mere portion of primary key. In the example above, the rule requires that Class and Room cannot be determined by Instructor or Day alone. This rule is met, so this table is in second normal form.

Note that since this rule describes tables whose primary key is a combination of columns, any table in first normal form that has a single column as its primary key is automatically in second normal form.

So, how do you find and fix tables that are not in 2NF? You examine tables for dependencies. Which columns are determined by which other columns? How much data in the table is repeated from one row to another? You do the same thing when imposing Third normal for, so let's think about it for a minute.

Third normal form lets us use another term from the text. A determinant is a column (or set of columns) that determines the contents of another column. A table is in third normal form if the only determinants in it are candidate keys. (Remember that the primary key is also a candidate key.) The procedure to cure a violation of this rule is simple: we copy the non-candidate determinant into a new table and move the columns that it determines to the new table. The non-candidate determinant becomes the primary key of the new table. It continues to exist in the old table, relating the two tables.

Problems are created when the correction procedures are not done correctly. Students should examine the examples in the text. Note that when the solution is to decompose a table into two or more tables, each resultant table must be in 3NF when you are done.

Consider this example from About.com:

CustNum FirstName LastName Address City State ZIP
1 John Doe 12 Main Street Sea Cliff NY 11579
2 Alan Johnson 82 Evergreen Tr Sea Cliff NY 11579
3 Beth Thompson 1912 NE 1st St Miami FL 33157
4 Jacob Smith 142 Irish Way South Bend IN 46637
5 Sue Ryan 412 NE 1st St Miami FL 33157

Like the example in the text, there is repeated data that could be a problem when updating the database. Assume that we have lots of customers in each of a dozen cities. There is the potential for entering an error each time the same data element is entered. We can simplify data entry by removing the columns that are dependent on a single nonkey column and creating a new table. Since ZIP determines City and State, we remove the City and State columns from the table, and create a new table containing ZIP, City, and State.

ZIP City State
11579 Sea Cliff NY
33157 Miami FL
46637 South Bend IN

This allows us to simplify the original table:

CustNum FirstName LastName Address ZIP
1 John Doe 12 Main Street 11579
2 Alan Johnson 82 Evergreen Tr 11579
3 Beth Thompson 1912 NE 1st St 33157
4 Jacob Smith 142 Irish Way 46637
5 Sue Ryan 412 NE 1st St 33157

The two resultant tables are related on the ZIP column. (Using color theory, I have related the blue table above to the yellow table below on the green column.)

When we place our data tables in 3NF, we avoid some classic problems found in tables that are not in first, second, or third normal form:

  • When data from a table like Sales Reps is represented in other tables by the Sales Rep number, we only have to update other attributes of a Sales Rep in one place, as opposed to many places. This simplifies updating problems.
  • Keeping all data about an entity in one short descriptive table and referencing that table by its key attribute in other tables also avoids inconsistent storage of the data, which is possible when the same value must be entered in many places.
  • Assume we did not have a separate table for sales reps, but stored the Sales Rep data in the Customer table. We could not add a new sales rep to the database without assigning that sales rep to one or more customers. If all the customers of an existing employee were reassigned, we would lose all our data about that sales person. In this respect, having separate tables for customers and sales reps allows us to add and delete from either, and allows for the possibilities of a customer without a rep and a rep without any customers.

The discussion of fourth normal form requires another term. If each value of column A has several associated values of column B, then B is multidependent on A. This is also called a multivalued dependency. Fourth normal form requires that we have third normal form, and no multivalued dependencies.

To avoid the multidependency problem, the text shows us a new method for attaining 1NF. In the first examples, we removed repeating groups and expanded the primary key. That will not work if we have multiple repeating groups. In the text example, we have a Faculty table whose primary key is FacultyNum. It includes FacultyName, and two repeating groups: StudentNum, StudentName, and CommitteeCode, CommitteeName. The idea is that each faculty member will belong to several committees, and will be the advisor for several students. In addition, a student may have several advisors due to multiple majors.

The method to resolve the problems of such a table is to spin off several tables. Solution: remove each repeating group to a new separate table, copy the primary key of the original table to the new table, and establish a new primary key for that table consisting of the copied key column and the key for the repeating group.

This results in simpler tables:
Faculty (FacultyNum, FacultyName)
FacStudent (FacultyNum, StudentNum, StudentName)
FacCommittee (FacultyNum, CommitteeCode, CommitteeName)