CIS 2210 - Database Management and Design

Chapter 6: Normalization

Objectives:

This lesson discusses material from chapter 6. Objectives important to this lesson:

  1. What is normalization?
  2. Normal forms
  3. Transforming relations to higher normal forms
  4. Using normalization with ER modeling
  5. Denormalization
Concepts:

Chapter 6

The chapter begins with several bullet points about normalization:

  • definition: The text says it is the process of changing data structures to minimize data redundancy. It is more than that, but controlling redundancy is a primary goal.
  • benefit: Normalization reduces the number of data anomalies by controlling how tables are related.
  • control: Tables are adapted from their initial states to meet specific standards (1NF, 2NF, 3NF, etc.)
  • benefit: Each successive level of normalization is an improvement of the lower numbered forms.
  • economy: Once you achieve 3NF, you may already be in 4NF.
  • reversal: Tables can be denormalized, reduced to a lower level of normalization, to increase performance in the DBMS. This is not intuitive. It is sometimes done because it is easier for the DBMS to handle fewer tables that are not as normalized as they could be.
  • Benefit: Using normalization in the design process improves the table designs, their attributes, and their relationships.

The text presents an overview of a normalization process:

  • Each table is about a single subject (entity).
  • We will reduce the number of places individual data items are stored.
  • All attributes in a table that are not part of the primary key should be dependent on the primary key. If not, the dependent items should be moved to another table, and the column that determines them should be copied to the new table as its primary key.
  • Inserting, updating, and deleting rows from the table will not cause anomalies.
  • All tables are raised to 3NF or higher.
  • The process addresses one table at a time, but considers all tables in a relationship and breaks out attributes into new tables when needed.

Table 6.2 offers a summary of five normal forms, but it uses language that has not been discussed yet. Let's cover some of the necessary terms.

  • The summary for 1NF says the data is in a table, there are no repeating groups, and the Primary Key has been identified. This is accurate, but it leaves out some of the other features of 1NF.
  • The summary for 2NF says that our table is in 1NF, and it has no partial dependencies. This means that if you are using a composite primary key, there is no column in the table that can be determined by a portion of the primary key. If there was such a column, it should be moved to a new table, along with a copy of the part of the primary key that determines the dependent column. Remember that if your primary key is a single attribute, and your table is in 1NF, you can't be in violation of 2NF.
  • The summary for 3NF says the table is in 2NF, and it has no transitive dependencies. This says that you can't have a column in the table that is determined by another column, unless the determinant column is the primary key. (Alternate keys are still okay.) This is a lot like the 2NF problem, but this time we are looking for determinants that are not part of the primary key.
  • The summary for Boyce-Codd normal form says that every determinant in the table is a candidate key. Another way to say it is that we don't have any attributes that depend on other attributes unless those attributes are candidate keys. This was previously taught as the second half of 3NF, so we should be there already if we are paying attention. If our table has this problem, we resolve it by the same method used in the two bullets above.
  • The summary for 4NF is quite murky. We can say that 4NF requires that there be no dependencies on combinations of non-key fields. This is like an expanded version of Boyce-Codd, but it involves composite non-key determinants.

More terminology that is related to dependence:

  • functional dependence - an attribute is functionally dependentt on another attribute (its determinant) if each value in the determinant has a 1:1 correspondence with a specific value in the dependent attribute; if we know a value in the determinant, and we know a corresponding value in the dependent, we can be sure that additional rows having the same value in the determinant will have the same corresponding value in the dependent
  • fully functional dependence - this is the same condition as functionnal dependence, except that the dependent is determined by a composite of multiple attributes
  • transitive dependence - as noted above, a transitive dependencce is when an attribute is dependent on a determinant that is not a primary key or a candidate key

The text offers some examples of normalizing tables, most of which make sense. We will talk about them in class and in discussion forums.

I would like to observe that dependence is a tricky thing. There are really two meanings.

  • Some fields are obviously dependent on other fields. The value of a field can predict what value must occur in another field. This is one meaning of dependency and determinance.
  • Other fields are dependent on a primary key only because they are part of the record that a specific value in the primary key identifies. There is nothing about my ID number that predicts my name, my address, or any other attribute of mine. The value in my address field is determined by the fact that it is mine and it belongs in my record. Dependency and determinance can simply mean that a data value is in its proper record. This meaning is less dramatic than the first meaning, but it is the more common meaning for data in relations: what are the values that are part of the record we are talking about.

Denormalization can be considered as leaving one form of normalization to use a simpler form. You may have noticed (if not, notice it now) that moving to each higher level of normalization may mean adding more tables to the database. Queries that use those tables will have to open each table that is needed for each query, slowing the processes that are needed to complete it. In order to speed up a system, you may want to consider skipping a level or two of normalization. In table 6.6 (or nearby if the book has changed again), we see some examples of "problems" that happen when a table is not normalized as much as possible. The table provides reasons for choosing the denormalized version, and suggestions to handle the less normalized version.

The chapter ends with a check list of advice about conducting and completing the data modeling portion of a project:

  • business rules - document and verify all; make sure they are clear and simple; make sure all rules are valid and current
  • naming conventions for entities, attributes, and relationships
  • entity concerns
  • attribute concerns
  • relationship concerns
  • ER model concerns