Concepts:Chapter 2 The chapter begins with some ideas about models. The first ones to consider are three general definitions:
Models are used as planning tools, and as tools to confirm the understanding we have about the business whose data our system is supposed to model. We confirm our understanding by discussing it with subject matter experts, usually the people who actually capture or use the data. The text reviews some of the database terminology we talked about last week, and adds some thoughts about relationships between entities. Relationships are characterized three ways:
Knowing the kind of relationship that instances (of one entity) have with instances (of another entity) can lead you to apply rules about whether a field in a database must be filled in, or is allowed to be blank or empty. When it must be filled in, a rule is needed that specifies what kind of data must go in that field. This is one kind of constraint, a requirement that helps keep the database in order. Constraints often are created to accommodate business rules. The text offers some advice about what you should look for when you write business rules.
The text offers a list of people in a company who might be good sources of information about business rules. We should observe that such information must be confirmed until a consensus is reached about what the rule really is, as opposed to what it might have been meant to be. On the same subject, the text offers some reasons for creating a set of business rules for the client. It may be necessary to point out to some clients that we need the business rules to create a system that supports them. We may also find ourselves in a situation in which there is no single rule about a number of business practices. If this is the case, we have an opportunity for the client to establish common practices in their organization. Note the guidelines about building business rules into our database system:
The text tosses in four concepts that relate to all databases:
The text changes direction and begins a section on the database type
you are most likely to encounter, the Relational
Database. A relational database can be considered as a collection of data stored
in tables. The tables, which must meet some strict rules, are
called relations, hence the name relational database. The
information about each entity in this kind of database is stored
in a table dedicated to that entity. The attributes of that
entity are represented as columns in the table, and the rows
represent unique instances (cases) of the entity. Rows
are also called tuples, but only
by people being terribly pedantic. In other words, in a table about students
(an entity), each column represents a characteristic common to students
(an attribute), and each row represents an individual student (an instance
of the entity).
Rows can also be called records or tuples. (No one likes that last word.) Columns represent characteristics, also called attributes or fields. Relationships between entities are created by the same columns appearing, logically, in two or more tables. You can't just copy a column from one table to another, it has to belong in that table for a reason. I will explain which column it should be in a few more lines. For the moment, trust me about this example: Our student table could be linked to a table of majors
by having a very significant column from the Majors table in each table.
In the majors table, each major would appear once. In the student table,
each major might appear any number of times, because many students have
the same major. This puts majors in a one-to-many relationship to students,
regardless of the fact that our student table has only two students in
it at this time.
In the example above, the two relations are linked by their shared attribute: Major. Note that there are three majors shown, and two students. There is no requirement that every major in the Majors table has to appear in the Student table, only that we could use the entry in the Student table to learn details about the major that a student happens to have, once we look it up in the Majors table. The DBMS can be told to do that in a query. You should know the formal definition of a relation, which has several parts. A relation is a two dimensional table that meets these rules:
The rules taken together are the definition of a table being in First Normal Form (1NF), which makes it possible to use that table in a relational database. The statements about row and column order being unimportant do not mean that you can't sort the table. They mean that sorting the table does not affect the data stored in the table. If you sort a table, you sort entire rows, so the information in any row stays with the rest of the information belonging to that row. Think of it like this: a row is a record. You want each record to stay intact when the table is sorted. Would you want information from someone else's record swapped with your record? (No, you wouldn't.) If a relation does not meet the first rule, that relation is unnormalized. There are more rules about normalization, but this one has to be met before the others become important. Changing topics again, the text spends some pages on several database
models. The next one is the Entity Relationship
Model. In another class, you might have encountered Entity Relationship
Diagrams (ERDs), which are the main attraction of this model. The text
shows us a notation chart that really only shows how to denote whether
entities are in One-to-One, One-to-Many, or Many-to-Many relationships
in three standard notations. It would be best to know that they exist
and that you may need to switch from one to another when you work for
different companies. You can use ERDs regardless of the kind of database
you are using. The next model is the Object-Oriented
Data Model (OODM), also
called the Semantic Data Model.
In this model, objects contain data, attributes, and information about
relations to other objects. They also contain Information about operations
(processing) that takes place on the data they contain. This model is
based on object oriented programming. In that kind of programming, you
have classes and subclasses.
The text also mentions the Extended Relational Data Model (ERDM), but not much is said about it. More is said about Extensible Markup Language (XML), which has already been mentioned, and whose acronym does not make much sense. Several newer technologies are mentioned as responses to the concept called Big Data, which is a catch-all concept for projects that deal with large amounts of varied, unstructured data. The technologies mentioned are Hadoop, Hadoop Distributed File System (HDFS), MapReduce, and NoSQL. The first two are based on a search engine, the third based on a combination of a search function and a summary function, and the fourth is based on the idea that some of your data may be stored in relations, and some may not be stored in relations. The text says a bit more about this concept. NoSQL may stand for non-SQL or for Not Only SQL. It supports relational data, and also data that is not stored in relations. Because it does not require that data be stored in normalized tables, it can be used with collections of data in many other formats, which describes much of the data on the Internet. The text offers an example of a table that is meant to be a relation, but it only has some of its cells filled in. When stored in a NoSQL key-value table, that table only holds the information that is known about the instances in the original table, and does not hold any cells that would have been blank in the intended relation. You may see how this would make a search faster, but it would also leave out the fact that some of the data elements are missing. The text spends several pages summarizing the features of the models it has discussed, offering a chart that describes their conceptual natures.
In the list above, we can also say that models at the top of the list have little data abstraction, and models at at the bottom of the list have high data abstraction. Abstraction is defined in this chapter as being "only a model", representing a real data structure or relationship but only in a way that can be understood or summarized easily. Chapter 3 Having given you an overview of database methods and trends last week in chapter 1, chapter 3 returns to relational databases, which are the main kind you will deal with in the real world. The chapter presents a list of features that (mostly) define a relation:
Keys are very important to relations.
One of the more formal rules for relations states that rows
must be unique in a table. Keys are used to identify them as being
unique. An example of a key field would be a field that holds a number
that is unique for each row. Key fields are also used to link
one relation to another, as noted in chapter 2. Although a key is required, there is no requirement that there must be
only one key in a relation. A relation may have several
fields that might be used as that relation's primary
key, the field that is chosen as the one that is to be used as
its unique identifier. In fact, sometimes it is necessary to use a combination
of several fields as the key. In those cases, the key is also called a
composite key. The rule for primary keys states that a column is the primary key if two things are true:
This means that a primary key must determine all other columns. The key will take you to a unique set of values among all the rows in that table. It also means that if the primary key consists of more than one column, no portion of the primary key may act as a primary key. If the subset portion of the rule is violated, it means that you have not identified the correct primary key. Candidate keys are those columns that could be chosen as the primary key. It is possible that more than one candidate key exists in a table. For instance, a table that holds both employee ID numbers and Social Security numbers could use either column as a primary key. Any candidate key that is not chosen to be the primary key is designated an alternate key. So, important terms from that section:
Because it is an important issue, most DBMSs will allow you to set constraints
(rules) about particular fields. For instance, the NOT
NULL constraint requires that a field must have a value in it.
This is not the same thing as the UNIQUE
constraint, which requires that the value in a field must not duplicate
any other value in the same field in that file. However, since null entries
are all the same, applying both constraints to a field that is the primary
key for a table would make sense. The chapter is about to get involved in actual SQL commands, so pay attention. The text introduces the concept of relational algebra, a set of operations performed with commands that examine and manipulate relations. The commands are called operators, but only in formal discussions. Performing a relational operation on one or more relations will produce one or more new relations. These new relations can be saved as files, but they don't have to be. The text refers to the creation of new relations from old ones as giving closure to relational operators. In English, this makes no sense, but we have to accept it as a required definition. The text introduces some relational operators:
Some examples are given of the results these operators would produce
on some example tables. Note that the examples in the text do NOT use
correct SQL syntax. They are meant to be read as hybrids of the correct
syntax and a user request for the information. Relational models and normalizationA large part of the assignment for this week requires you to have knowledge about data normalization. The text does not appear to spend any pages on general database design and normalization. Instead of hoping you have seen it before, here are some thoughts about those ideas from another text: Normalization is the process of making your data tables comply with several rules. Compliance with each successive rule takes the tables to the next level of normalization. The levels are called normal forms, specifically first, second, third, and fourth (also called Boyce-Codd) normal form. The forms can also be referred to by the symbols 1NF, 2NF, 3NF, and 4NF (BCNF). The first objective is to define functional dependence. A column is functionally dependent on another column if the value found in the second column determines the value found in the first column. Suppose we had a database about customers who live in either Canada or the United States. Let's have a column in the customer table called State/Province, and another column called Country. If the value in State/Province for a customer is Michigan, the value in Country for that customer will be United States. If the value in State/Province is Ontario, the value in Country will be Canada. So the Country column is functionally dependent on State/Province. Likewise, the State/Province column functionally determines Country. Note that such a relationship does not have to flow both ways: knowing a customer's country of residence will not determine their state/province of residence. Let's consider a problem that database developers can find on their screens. In small tables, it is tempting to find dependencies that may not be real. If we have only three sales people in a Rep table, we could conclude that RepNum (the sales rep's ID number) determines LastName, and LastName determines RepNum. It is important for the database developer to ask about the future, and to imagine what can and may happen. Hiring a new sales person with the same last name as a current sales rep shows that RepNum will determine LastName, but LastName will not determine any other column. RepNum, in fact, determines every other field in the Rep table, since each sales person has a unique number. That is what row ID numbers are for. This makes RepNum a key for the Rep table. It is in fact the primary key for this table. The rule for primary keys states that a column is the primary key if two things are true:
Candidate keys are those columns that could be chosen as the primary key. It is possible that more than one candidate key exists in a table, for instance, I suggested above that a table that holds both employee ID numbers and Social Security numbers could use either column as a primary key. (Except for the little rule that you are not allowed to use a Social Security number and an identifier. Laws must be followed...) Any candidate key that is not chosen to be the primary key can be called an alternate key. The definition of first normal form says that a table may not contain repeating groups. Another way of saying this is that cells in a table may only contain single values. A standard cure for repeating groups split the records with repeating groups into multiple records. It may also be necessary to expand the primary key when you are trying to get rid of repeating groups. That's because the second problem with repeating groups is that the column representing the primary key does not determine the value of the field containing the repeating groups. This becomes obvious when you expand such a row, making it multiple rows containing the same value for the primary key. Let's see an example. Assume the following table:
The Class and Room columns contain repeating values. The Instructor column is meant to be the primary key, but it does not meet the rule. The DBMS can't tell which class meets in which room. We can fix this by expanding the primary key, adding a column for the second part of the new primary key, and splitting the values of the repeating groups into separate rows.
In the expanded table, the new primary key is the combination of the Instructor and Day columns. The combination determines the contents of the remaining columns. This table is now in first normal form. A possible problem is that it assumes that no instructor has more than one class per day. What would you have to do if that were not true? Second normal form says that a table is in first normal form, and it has a new quality: No column that is not part of the primary key is determined by a subset of the primary key. Let's say that another way: if the primary key is made of more than one column, no column in the table can be determined by a mere portion of primary key. In the example above, the rule requires that Class and Room cannot be determined by Instructor or Day alone. This rule is met, so this table is in second normal form. Note that this rule describes tables whose primary key is a combination of columns. Any table in first normal form that has a single column as its primary key is automatically in second normal form. So, how do you find and fix tables that are not in 2NF? You examine tables for dependencies. Which columns are determined by which other columns? How much data in the table is repeated from one row to another? You do the same thing when imposing Third normal form, so let's think about it for a minute. Third normal form lets us use another term from the text. A determinant is a column (or set of columns) that determines the contents of another column. A table is in third normal form if the only determinants in it are candidate keys. (Remember that the primary key is also a candidate key.) The procedure to cure a violation of this rule is simple: we copy the non-candidate determinant into a new table and move the columns that it determines to the new table. The non-candidate determinant becomes the primary key of the new table. It continues to exist in the old table, relating the two tables. Problems are created when the correction procedures are not done "correctly". Note that when the solution is to decompose a table into two or more tables, each resultant table must be in 3NF when you are done. Consider this example from (the now defunct) About.com:
In this example, there is repeated data that could be a problem when updating the database. Assume that we have lots of customers in each of a dozen cities. There is the potential for entering an error each time the same data element is entered (e.g. city, state, or ZIP). We can simplify data entry by removing the columns that are dependent on a single nonkey column and creating a new table. Since ZIP determines City and State, we remove the City and State columns from the table, and create a new table containing ZIP, City, and State.
This allows us to simplify the original table:
The two resultant tables are related on the ZIP column. (Using reflective color theory, I have related the blue table above to the yellow table below on the green column.) When we place our data tables in 3NF, we avoid some classic problems found in tables that are not in first, second, or third normal form: When data from a table like Customers is represented in other tables by the CustNum identifier field, we only have to update other attributes of a Customer in one place, as opposed to many places. This simplifies updating problems. Keeping all data about an entity in one short descriptive table and referencing that table by its key attribute in other tables also avoids inconsistent storage of the data, which is possible when the same value must be entered in many places. Assume we did not have a separate table for sales reps, but stored the Sales Rep data in the Customer table. We could not add a new sales rep to the database without assigning that sales rep to one or more customers. If all the customers of an existing employee were reassigned, we would lose all our data about that sales person. In this respect, having separate tables for customers and sales reps allows us to add and delete from either, and allows for the possibilities of a customer without a rep and a rep without any customers. The discussion of fourth normal form requires another term. If each value of column A has several associated values of column B, then B is multidependent on A. This is also called a multivalued dependency. Fourth normal form requires that we have third normal form, and no multivalued dependencies. To avoid the multidependency problem, the text shows us a new method for attaining 1NF. In the first examples, we removed repeating groups and expanded the primary key. That will not work if we have multiple repeating groups. In the text example, we have a Faculty table whose primary key is FacultyNum. It includes FacultyName, and two repeating groups: StudentNum, StudentName, and CommitteeCode, CommitteeName. The idea is that each faculty member will belong to several committees, and will be the adviser for several students. In addition, a student may have several advisers due to multiple majors. The method to resolve the problems of such a table is to spin off several tables. Solution: remove each repeating group to a new separate table, copy the primary key of the original table to the new table, and establish a new primary key for that table consisting of the copied key column and the key for the repeating group. This results in simpler tables:
|