CIS 2210 - Database Management and Design

Chapter 2: Data Models
Chapter 3: The Relational Database Model

Objectives:

This lesson discusses material from chapters 2 and 3. Objectives important to this lesson:

  1. Data modeling
  2. Modeling tools
  3. Business rules
  4. Evolution of data models
  5. Newer models
  6. Classifying models by level of abstraction

  7. Logical views in relational databases
  8. Relations: components and connections
  9. Redundancy in relational databases
  10. Indexing
Concepts:

Chapter 2

The chapter begins with some ideas about models. The first ones to consider are three general definitions:

  • model - an abstraction of a real object or event; a model is a simplified way of considering the "important" parts of an object
  • data models - models of real data structures; models of the real information that a system uses
  • data modeling - creating and refining data models that are part of a system being designed or reimagined

Models are used as planning tools, and as tools to confirm the understanding we have about the business whose data our system is supposed to model. We confirm our understanding by discussing it with subject matter experts, usually the people who actually capture or use the data.

The text reviews some of the database terminology we talked about last week, and adds some thoughts about relationships between entities. Relationships are characterized three ways:

  • One-to-many (1:M) means that each instance of one entity may be related to several instances of another entity. For example, in a vendor's database, each customer (an instance of an entity we might call Buyers) would be related to many orders (instances of an entity we might call Requests). However, each of those orders is related to one and only one customer.
  • Many-to-Many (M:N, or M:M) is easier to think about with an example first. Consider a database for a college. In it, there could be a Student table and a Classes table. In the current semester, each instance of a Student could be enrolled in many classes, and each instance of a Class could have many students enrolled in it. Formally, we can say that each instance in one table may be linked to many instances in the other table. It works like that in both directions.
  • One-to-One (1:1) means that each instance in one table is linked to only one instance in the other linked table. Again, this works in both directions: there is a relationship between pairs of instances. For example, an instance of a person may be linked to one specific ID number of a particular type. (I wrote it that way because a person may have more than one kind of ID, but they should never have more than one passport number, driver license number, employee ID number, etc.)

Knowing the kind of relationship that instances (of one entity) have with instances (of another entity) can lead you to apply rules about whether a field in a database must be filled in, or is allowed to be blank or empty. When it must be filled in, a rule is needed that specifies what kind of data must go in that field. This is one kind of constraint, a requirement that helps keep the database in order.

Constraints often are created to accommodate business rules. The text offers some advice about what you should look for when you write business rules.

  • Make them them brief, precise, and unambiguous.
  • Make one for each practice or interaction that the company performs.
  • Make sure to describe the data that will be used when the rule is followed.

The text offers a list of people in a company who might be good sources of information about business rules. We should observe that such information must be confirmed until a consensus is reached about what the rule really is, as opposed to what it might have been meant to be.

On the same subject, the text offers some reasons for creating a set of business rules for the client. It may be necessary to point out to some clients that we need the business rules to create a system that supports them. We may also find ourselves in a situation in which there is no single rule about a number of business practices. If this is the case, we have an opportunity for the client to establish common practices in their organization.

Note the guidelines about building business rules into our database system:

  • nouns in rules will represent entities or instances of entities
  • verbs represent relationships that you will need to use to connect entities
  • make sure you understand relationships from both directions
  • name entities with names that are apt, and that the client understands
  • name attributes according to the same ideas, using terms that are used in the business; do not use names for entities and attributes that are actual commands in the database language you will be using

The text tosses in four concepts that relate to all databases:

  • schema - this word can mean the architecture of a database, including information about each object in it; it can also mean a document that contains all of this information
  • subschema - a view of the part of the database that is actually used by a particular application; there may be a subschema for each application that uses the database
  • data manipulation language (DML) - the actual command language of the database, typically a version of SQL (Structured Query Language)
  • schema data definition language (DDL) - the language in which schema descriptions are written, typically XML (Extended Markup Language)

The text changes direction and begins a section on the database type you are most likely to encounter, the Relational Database.

A relational database can be considered as a collection of data stored in tables. The tables, which must meet some strict rules, are called relations, hence the name relational database. The information about each entity in this kind of database is stored in a table dedicated to that entity. The attributes of that entity are represented as columns in the table, and the rows represent unique instances (cases) of the entity. Rows are also called tuples, but only by people being terribly pedantic. In other words, in a table about students (an entity), each column represents a characteristic common to students (an attribute), and each row represents an individual student (an instance of the entity).

Student Number Last Name First Name Grade
1100123 Jones Shane B
1100136 Smith Kim A

Rows can also be called records or tuples. (No one likes that last word.) Columns represent characteristics, also called attributes or fields.

Relationships between entities are created by the same columns appearing, logically, in two or more tables. You can't just copy a column from one table to another, it has to belong in that table for a reason. I will explain which column it should be in a few more lines. For the moment, trust me about this example:

Our student table could be linked to a table of majors by having a very significant column from the Majors table in each table. In the majors table, each major would appear once. In the student table, each major might appear any number of times, because many students have the same major. This puts majors in a one-to-many relationship to students, regardless of the fact that our student table has only two students in it at this time.

StudentNum LastName FirstName Grade Major
1100123 Jones Shane B 1
1100136 Smith Kim A 3

Major Description
1 Accounting
2 Computer Science
3 Nursing

In the example above, the two relations are linked by their shared attribute: Major. Note that there are three majors shown, and two students. There is no requirement that every major in the Majors table has to appear in the Student table, only that we could use the entry in the Student table to learn details about the major that a student happens to have, once we look it up in the Majors table. The DBMS can be told to do that in a query.

You should know the formal definition of a relation, which has several parts. A relation is a two dimensional table that meets these rules:

  1. Each entry in the table contains one and only one value. No cell may contain more than one value, but some may be allowed to be blank. A value may contain multiple words, like a street address, but it cannot contain two addresses. That would require two address fields/columns/attributes which should have different names (see below).
  2. Each column represents one attribute of the entity, and must have a unique name. Each column name is used only once in any table.
  3. All values in a column are values of the attribute that column is about.
  4. The order of columns in the table is not important.
  5. Each row is unique in some way, often accomplished by an ID number for the row or the case in the row.
  6. The order of the rows is not important.

The rules taken together are the definition of a table being in First Normal Form (1NF), which makes it possible to use that table in a relational database. The statements about row and column order being unimportant do not mean that you can't sort the table. They mean that sorting the table does not affect the data stored in the table. If you sort a table, you sort entire rows, so the information in any row stays with the rest of the information belonging to that row. Think of it like this: a row is a record. You want each record to stay intact when the table is sorted. Would you want information from someone else's record swapped with your record? (No, you wouldn't.)

If a relation does not meet the first rule, that relation is unnormalized. There are more rules about normalization, but this one has to be met before the others become important.

Changing topics again, the text spends some pages on several database models. The next one is the Entity Relationship Model. In another class, you might have encountered Entity Relationship Diagrams (ERDs), which are the main attraction of this model. The text shows us a notation chart that really only shows how to denote whether entities are in One-to-One, One-to-Many, or Many-to-Many relationships in three standard notations. It would be best to know that they exist and that you may need to switch from one to another when you work for different companies. You can use ERDs regardless of the kind of database you are using.

The next model is the Object-Oriented Data Model (OODM), also called the Semantic Data Model. In this model, objects contain data, attributes, and information about relations to other objects. They also contain Information about operations (processing) that takes place on the data they contain. This model is based on object oriented programming. In that kind of programming, you have classes and subclasses.

  • Classes are templates for objects.
  • They can be used as they are, or they can be modified to have more attributes or methods (processes) than the template contains. This is how you create a subclass: it has more stuff in it than its parent class.
  • Subclasses inherit all the methods and attributes of their parent class, as well as having their own methods and attributes.
  • This works the same way in OODM for databases. The text suggest that Unified Modeling Language (UML) should be used to document this kind of model. In the example in the text, we see that the notation for UML is very similar to the ER Model notation. The UML notation has a bit more information in it, and may be easier to read.

The text also mentions the Extended Relational Data Model (ERDM), but not much is said about it. More is said about Extensible Markup Language (XML), which has already been mentioned, and whose acronym does not make much sense.

Several newer technologies are mentioned as responses to the concept called Big Data, which is a catch-all concept for projects that deal with large amounts of varied, unstructured data. The technologies mentioned are Hadoop, Hadoop Distributed File System (HDFS), MapReduce, and NoSQL. The first two are based on a search engine, the third based on a combination of a search function and a summary function, and the fourth is based on the idea that some of your data may be stored in relations, and some may not be stored in relations. The text says a bit more about this concept.

NoSQL may stand for non-SQL or for Not Only SQL. It supports relational data, and also data that is not stored in relations. Because it does not require that data be stored in normalized tables, it can be used with collections of data in many other formats, which describes much of the data on the Internet. The text offers an example of a table that is meant to be a relation, but it only has some of its cells filled in. When stored in a NoSQL key-value table, that table only holds the information that is known about the instances in the original table, and does not hold any cells that would have been blank in the intended relation. You may see how this would make a search faster, but it would also leave out the fact that some of the data elements are missing.

The text spends several pages summarizing the features of the models it has discussed, offering a chart that describes their conceptual natures.

  • Hierarchical and Network models - did not represent M:N relations well; structure determined its possible reports; did not allow ad hoc queries
  • Relational - logical associations of data; allowed SQL queries; allowed restrictions on user view of data
  • Entity Relationship - added semantic ideas; good for modeling, but no implementation as a DBMS
  • Semantic, Object-Oriented, and Extended Relational - added more semantics; supported complex data, classes, and inheritance; allows XML which allowed sharing data with other systems in that format
  • Big Data - reduced semantics; allowed non-relational data, and data having gaps

In the list above, we can also say that models at the top of the list have little data abstraction, and models at at the bottom of the list have high data abstraction. Abstraction is defined in this chapter as being "only a model", representing a real data structure or relationship but only in a way that can be understood or summarized easily.

Chapter 3

Having given you an overview of database methods and trends last week in chapter 1, chapter 3 returns to relational databases, which are the main kind you will deal with in the real world. The chapter presents a list of features that (mostly) define a relation:

  1. A relation is presented as a table having rows and columns.
  2. A row in a relation represents an instance of an entity that the table holds information about.
  3. Each column in the relation has a unique (in the context of that table) name, and it represents one attribute of the entity.
  4. Each intersection of a row and a column is a cell, which may hold only one appropriate value for the instance and attribute it represents.
  5. Values in any column must be formatted in the same way as all other values in that column.
  6. The range of allowed values in a particular column is called the attribute domain.
  7. The order of the rows and columns in the actual table is not important to the DBMS. (Sorting can be done as needed.)
  8. Each table must have either one attribute or a combination of attributes that can be used to identify each row as being different from all the others. (This is called the key for the table.)

Keys are very important to relations. One of the more formal rules for relations states that rows must be unique in a table. Keys are used to identify them as being unique. An example of a key field would be a field that holds a number that is unique for each row. Key fields are also used to link one relation to another, as noted in chapter 2.

Although a key is required, there is no requirement that there must be only one key in a relation. A relation may have several fields that might be used as that relation's primary key, the field that is chosen as the one that is to be used as its unique identifier. In fact, sometimes it is necessary to use a combination of several fields as the key. In those cases, the key is also called a composite key.

The rule for primary keys states that a column is the primary key if two things are true:

  • All columns are functionally dependent on the primary key. This means that the key can be used to specify which row we are talking about.
  • No subset of columns in the primary key functionally determines all other columns.

This means that a primary key must determine all other columns. The key will take you to a unique set of values among all the rows in that table. It also means that if the primary key consists of more than one column, no portion of the primary key may act as a primary key. If the subset portion of the rule is violated, it means that you have not identified the correct primary key.

Candidate keys are those columns that could be chosen as the primary key. It is possible that more than one candidate key exists in a table. For instance, a table that holds both employee ID numbers and Social Security numbers could use either column as a primary key. Any candidate key that is not chosen to be the primary key is designated an alternate key.

So, important terms from that section:

  • functional dependence - the state in which the value in some field (or an aggregate of fields) can determine the value of one or more other fields
  • determinant - a field (or an aggregate of fields) whose value determines the value of another field or fields in its row
  • dependent - a field whose value is determined by the value in another field
  • key attribute - an attribute that is used as part of a composite key; note that key attributes are not allowed to contain nulls. (A cell that is empty is often said to hold a null, but technically, a null is a placeholder that mean "this space intentionally left blank". In this rule, the cell cannot be empty and cannot hold a null.)
  • entity integrity - the values in a relation's primary key must be unique in that relation; there can be no repeats of a primary key's value in a relation; this state is attained when all values in a relation's primary key are unique, and no portion of the primary key is blank; this makes primary keys functional
  • secondary key - an alternate key that can be used for data retrieval
  • foreign key - the primary key of some other table; a foreign key is typically the information that is used to link one relation to another relation; in the relation in which it is the foreign key (not the one in which it is primary), that field must match a value in the same field in the linked relation; the rules say a foreign key may have a null entry, but that is a bad practice
  • referential integrity - the state in which each value in a foreign key matches some value in the primary key of the related table (unless the value in the foreign key is blank, which is still a bad idea)

Because it is an important issue, most DBMSs will allow you to set constraints (rules) about particular fields. For instance, the NOT NULL constraint requires that a field must have a value in it. This is not the same thing as the UNIQUE constraint, which requires that the value in a field must not duplicate any other value in the same field in that file. However, since null entries are all the same, applying both constraints to a field that is the primary key for a table would make sense.

The chapter is about to get involved in actual SQL commands, so pay attention.

The text introduces the concept of relational algebra, a set of operations performed with commands that examine and manipulate relations. The commands are called operators, but only in formal discussions. Performing a relational operation on one or more relations will produce one or more new relations. These new relations can be saved as files, but they don't have to be. The text refers to the creation of new relations from old ones as giving closure to relational operators. In English, this makes no sense, but we have to accept it as a required definition.

The text introduces some relational operators:

  • select - This command is used to specify which fields to report from a table. Specifying the fields is what the text means by "a horizontal subset". The select command has more options that allow us to specify table names and row filters.
  • project - The text shows us that project may be another name for the select command.
  • union - This command results in a table holding all rows from a list of specific tables, with any duplicate rows discarded. This should only be carried out on tables that have the same number of fields, and whose fields have the same rules for acceptable values. This is called being union compatible.
  • intersect - This command results in a table holding one copy of each of the rows that are common to a specific list of tables. The tables must be union compatible for this command to work.
  • difference - This command, like union and intersect, requires the names of two tables. In this case, it returns the rows from the first named table that are NOT found in the second named table. You should understand that this command would give a different result if the two table names were given in the reverse order. Another way to think of this is that it returns the contents of the first table, minus any intersections with the second table. The tables must be union compatible for this command to work.
  • product - This one is harder to visualize from the book description. Imagine two tables with nothing in common. The first table has 10 rows. The second table has 5. The return value of this command combines each row in the first table with each row in the second table. In this case, the result would have 50 rows: 10 times 5, hence the name of the operator. It displays every possible combination of one row from each table combined as a single new row in the new table. It is unusual to want to do this kind of query.
  • join - a join typically combines information from two or more tables, as you might do when printing customer addresses from a customer table (that holds customer names, street information, and ZIP codes) and a ZIP code table (that holds ZIP codes, cities, and states). This may be the most common kind of operation in a database.

Some examples are given of the results these operators would produce on some example tables. Note that the examples in the text do NOT use correct SQL syntax. They are meant to be read as hybrids of the correct syntax and a user request for the information.

Relational models and normalization

A large part of the assignment for this week requires you to have knowledge about data normalization. The text does not appear to spend any pages on general database design and normalization. Instead of hoping you have seen it before, here are some thoughts about those ideas from another text:

Normalization is the process of making your data tables comply with several rules. Compliance with each successive rule takes the tables to the next level of normalization. The levels are called normal forms, specifically first, second, third, and fourth (also called Boyce-Codd) normal form. The forms can also be referred to by the symbols 1NF, 2NF, 3NF, and 4NF (BCNF).

The first objective is to define functional dependence. A column is functionally dependent on another column if the value found in the second column determines the value found in the first column. Suppose we had a database about customers who live in either Canada or the United States. Let's have a column in the customer table called State/Province, and another column called Country. If the value in State/Province for a customer is Michigan, the value in Country for that customer will be United States. If the value in State/Province is Ontario, the value in Country will be Canada. So the Country column is functionally dependent on State/Province. Likewise, the State/Province column functionally determines Country. Note that such a relationship does not have to flow both ways: knowing a customer's country of residence will not determine their state/province of residence.

Let's consider a problem that database developers can find on their screens. In small tables, it is tempting to find dependencies that may not be real. If we have only three sales people in a Rep table, we could conclude that RepNum (the sales rep's ID number) determines LastName, and LastName determines RepNum. It is important for the database developer to ask about the future, and to imagine what can and may happen. Hiring a new sales person with the same last name as a current sales rep shows that RepNum will determine LastName, but LastName will not determine any other column.

RepNum, in fact, determines every other field in the Rep table, since each sales person has a unique number. That is what row ID numbers are for. This makes RepNum a key for the Rep table. It is in fact the primary key for this table.

The rule for primary keys states that a column is the primary key if two things are true:

  • All columns are functionally dependent on the primary key
  • No subset of columns in the primary key functionally determines all other columns

Candidate keys are those columns that could be chosen as the primary key. It is possible that more than one candidate key exists in a table, for instance, I suggested above that a table that holds both employee ID numbers and Social Security numbers could use either column as a primary key. (Except for the little rule that you are not allowed to use a Social Security number and an identifier. Laws must be followed...) Any candidate key that is not chosen to be the primary key can be called an alternate key.

The definition of first normal form says that a table may not contain repeating groups. Another way of saying this is that cells in a table may only contain single values. A standard cure for repeating groups split the records with repeating groups into multiple records.

It may also be necessary to expand the primary key when you are trying to get rid of repeating groups. That's because the second problem with repeating groups is that the column representing the primary key does not determine the value of the field containing the repeating groups. This becomes obvious when you expand such a row, making it multiple rows containing the same value for the primary key. Let's see an example. Assume the following table:

Instructor Class Room
Vincent CIS107a
CIS331
S107
S102
Scott CIS302a
CIS331
D204
D204

The Class and Room columns contain repeating values. The Instructor column is meant to be the primary key, but it does not meet the rule. The DBMS can't tell which class meets in which room. We can fix this by expanding the primary key, adding a column for the second part of the new primary key, and splitting the values of the repeating groups into separate rows.

Instructor Day Class Room
Vincent Wednesday CIS107a S107
Vincent Thursday CIS331 S102
Scott Tuesday CIS302a D204
Scott Wednesday CIS331
D204

In the expanded table, the new primary key is the combination of the Instructor and Day columns. The combination determines the contents of the remaining columns. This table is now in first normal form. A possible problem is that it assumes that no instructor has more than one class per day. What would you have to do if that were not true?

Second normal form says that a table is in first normal form, and it has a new quality: No column that is not part of the primary key is determined by a subset of the primary key. Let's say that another way: if the primary key is made of more than one column, no column in the table can be determined by a mere portion of primary key. In the example above, the rule requires that Class and Room cannot be determined by Instructor or Day alone. This rule is met, so this table is in second normal form.

Note that this rule describes tables whose primary key is a combination of columns. Any table in first normal form that has a single column as its primary key is automatically in second normal form.

So, how do you find and fix tables that are not in 2NF? You examine tables for dependencies. Which columns are determined by which other columns? How much data in the table is repeated from one row to another? You do the same thing when imposing Third normal form, so let's think about it for a minute.

Third normal form lets us use another term from the text. A determinant is a column (or set of columns) that determines the contents of another column. A table is in third normal form if the only determinants in it are candidate keys. (Remember that the primary key is also a candidate key.) The procedure to cure a violation of this rule is simple: we copy the non-candidate determinant into a new table and move the columns that it determines to the new table. The non-candidate determinant becomes the primary key of the new table. It continues to exist in the old table, relating the two tables.

Problems are created when the correction procedures are not done "correctly". Note that when the solution is to decompose a table into two or more tables, each resultant table must be in 3NF when you are done.

Consider this example from (the now defunct) About.com:

CustNum FirstName LastName Address City State ZIP
1 John Doe 12 Main Street Sea Cliff NY 11579
2 Alan Johnson 82 Evergreen Tr Sea Cliff NY 11579
3 Beth Thompson 1912 NE 1st St Miami FL 33157
4 Jacob Smith 142 Irish Way South Bend IN 46637
5 Sue Ryan 412 NE 1st St Miami FL 33157

In this example, there is repeated data that could be a problem when updating the database. Assume that we have lots of customers in each of a dozen cities. There is the potential for entering an error each time the same data element is entered (e.g. city, state, or ZIP). We can simplify data entry by removing the columns that are dependent on a single nonkey column and creating a new table. Since ZIP determines City and State, we remove the City and State columns from the table, and create a new table containing ZIP, City, and State.

ZIP City State
11579 Sea Cliff NY
33157 Miami FL
46637 South Bend IN

This allows us to simplify the original table:

CustNum FirstName LastName Address ZIP
1 John Doe 12 Main Street 11579
2 Alan Johnson 82 Evergreen Tr 11579
3 Beth Thompson 1912 NE 1st St 33157
4 Jacob Smith 142 Irish Way 46637
5 Sue Ryan 412 NE 1st St 33157

The two resultant tables are related on the ZIP column. (Using reflective color theory, I have related the blue table above to the yellow table below on the green column.)

When we place our data tables in 3NF, we avoid some classic problems found in tables that are not in first, second, or third normal form:

When data from a table like Customers is represented in other tables by the CustNum identifier field, we only have to update other attributes of a Customer in one place, as opposed to many places. This simplifies updating problems.

Keeping all data about an entity in one short descriptive table and referencing that table by its key attribute in other tables also avoids inconsistent storage of the data, which is possible when the same value must be entered in many places.

Assume we did not have a separate table for sales reps, but stored the Sales Rep data in the Customer table. We could not add a new sales rep to the database without assigning that sales rep to one or more customers. If all the customers of an existing employee were reassigned, we would lose all our data about that sales person. In this respect, having separate tables for customers and sales reps allows us to add and delete from either, and allows for the possibilities of a customer without a rep and a rep without any customers.

The discussion of fourth normal form requires another term. If each value of column A has several associated values of column B, then B is multidependent on A. This is also called a multivalued dependency. Fourth normal form requires that we have third normal form, and no multivalued dependencies.

To avoid the multidependency problem, the text shows us a new method for attaining 1NF. In the first examples, we removed repeating groups and expanded the primary key. That will not work if we have multiple repeating groups. In the text example, we have a Faculty table whose primary key is FacultyNum. It includes FacultyName, and two repeating groups: StudentNum, StudentName, and CommitteeCode, CommitteeName. The idea is that each faculty member will belong to several committees, and will be the adviser for several students. In addition, a student may have several advisers due to multiple majors.

The method to resolve the problems of such a table is to spin off several tables. Solution: remove each repeating group to a new separate table, copy the primary key of the original table to the new table, and establish a new primary key for that table consisting of the copied key column and the key for the repeating group.

This results in simpler tables:
Faculty (FacultyNum, FacultyName)
FacStudent (FacultyNum, StudentNum, StudentName)
FacCommittee (FacultyNum, CommitteeCode, CommitteeName)