CIS 251 - System Development Methods

Chapter 9: Data Design
Chapter 10: System Architecture

Objectives:

This lesson discusses material from chapters 9 and 10. Objectives important to this lesson:

  1. Data design terms
  2. Data relationships
  3. Normalization
  4. Thoughts on networking
  5. Thin or thick client
  6. Assignments for the week
Concepts:

Chapter nine begins with a discussion on page 348 about data structures, and making a choice between file-oriented systems and relational database systems.

The text offers an example of a file-oriented system: a credit company might keep an ongoing file of transactions completed by its customers, perhaps storing a separate file for each day. There could be a batch process that runs "after hours" that aggregates the transactions for each account, and updates a customer file with the total change for each account. This system could work if there is such a thing as "after hours", which is unlikely in a world that offers goods for sale on the Internet. It would also require that we would be satisfied with updating the customers' records at the end of a chosen time period instead of completing live updates as the transactions are made.

The text proposes that this older system approach could support a company that chose to or needed to keep the records of each division of the company separate from the others. This compartmentalization presents a problem: redundant data elements probably exist, in that each division would need to keep records on every customer. This is wasteful in terms of data storage needs and costs. It also leads to data integrity problems: an update to a customer's record (change in name, address, balance, etc.) would need to be made by each division if separate file stores exist. It is quite likely that the data will be out of date or otherwise incorrect unless each division is notified when any division becomes aware of a request to change stored data. This invites inconsistent data into our system. The text is also critical of the fact that each data file is likely to be designed differently from the others, which would lead to difficulty in retrieving data from more than one file in a single query.

These are good reasons for the data industry to have moved on to more modern, flexible, and resilient data structures. It makes sense to move on to the kind of system you are more likely to build and support.

On page 350, the text lists several arguments about why database management systems (DBMSs) are more useful that file-oriented systems. This list may be easier to understand:

  • Getting more information from the same amount of data - You need to understand what the author means by those two terms. Data may be considered to be raw facts. Information is what we get when we process data to gain knowledge. A DBMS may help us see a relation between data elements that we did not think were related.
  • Sharing data - A DBMS will allow more users to access the stored data than a spreadsheet system can.
  • Balancing conflicting requirements - One of the jobs of a DBA (Database Administrator) is to impose structure on the database so that the company can get the most use out of it. Serving the needs of the greatest number of users (the company as a whole) is usually the goal.
  • Controlling redundancy - Redundancy means saving the same thing in several places. Saving a data element in only one place conserves storage space, and makes updating the system much simpler. Redundancy cannot be eliminated (which we will discuss shortly), but it can be reduced.
  • Facilitating consistency - Consistency is a direct result of reducing redundancy. When a data element is stored in fewer places, it is less likely to be stored incorrectly, or fail to be updated.
  • Improving integrity - Integrity constraints are rules for databases, such as a rule that a certain field must contain only numeric data, or a field used in a query must contain a data element that is in the database (think of a drop down list). DBMSs allow integrity constraints.
  • Expanding security - Security can be the assignment of passwords for DBMS users, or the assignment of different rights to various users or groups.
  • Increasing productivity - In the old days, a programmer would have to write a program for every query a user wanted to run on a database. A DBMS will allow that programmer to work on something else, once the users understand how to use the DBMS.
  • Providing data independence - This means that we can change the structure of the database, and add functions to the DBMS, without having to reenter all the data.

The text gives us a list of components found in most DBMSs, most of which are familiar:

  • user interface - used by people who use but do not control the system
    The text mentions two common kinds of user interface. The first is Query By Example (QBE), in which users enter data in fields to find records that match. In general, a query is a question that you ask the DBMS. Whatever DBMS you use, there will be a structured way to make a query. QBE allows you to make a query by filling in a grid or template.
    The second is any of several forms of Structured Query Language (SQL), which is a command line based language used to query and manage a database.
  • administrator interface - used by people who create and maintain databases; will contain a larger command set, containing management commands that general users will not be allowed to use
  • related system interface - used for communication between two or more systems that are allowed to interface with each other
  • data manipulation language - an interface that provides specific data control to administrators, such as storing, backing up, and updating data
  • schema - The text has already described a data dictionary, which is a repository of information about all the data structures in a database. A schema includes these features, but is different because it also includes rules about these features that control what fields are allowed in certain views, what operations may be carried out by different users, and what users are allowed to do in general. A subschema further defines what the database may do for an individual user type or group.

The text also presents some thoughts about web-based data design. It should be an assumption that any commercial entity will want to have a presence on the Internet, and most non-commercial entities will want one as well.

Let's move ahead to page 356, where the text presents more details about databases. First we enjoy some basic terms.

  • entity - something that you want to store data about, such as customers, accounts, employees, inventory, places, or events
  • attribute - a characteristic of an entity; also called a fieldcolumn, or property
  • record - also called a tuple, but only by textbooks; a set of all the attributes of one instance in a data file: if the data file is about students (the entity), the record of a particular student (an instance of the entity) consists of all the information stored about that student
  • relationship - an association between entities in your database. Customers are associated with their sales reps, in that each customer has one and only one sales rep. This is a relationship. From the point of view of the sales rep, this is a one to many relationship, since each sales rep has many customers, but each customer has only one sales rep. Relationships can be one to one, one to many, many to one, or many to many.
    The text discusses displaying these relationships as Entity Relationship Diagrams on pages 360 through 364.
  • data file - all data in a database is stored in files, including lists of entities, their attributes, and the relationships between entities

Normalization is the process of making your data tables comply with several rules. Compliance with each successive rule takes the tables to the next level of normalization. The levels are called normal forms, specifically first, second, third, and fourth normal form. The forms can also be referred to by the symbols 1NF, 2NF3NF, and 4NF.

The first objective for understanding normalization should be to define functional dependence.

  • A column is functionally dependent on another column if the value found in the second column determines the value found in the first column.

Suppose we had a database about customers who live in either Canada or the United States. Let's have a column in the customer table called State/Province, and another column called Country. In case you are not aware of it, all the states and provinces have unique names.

  • If the value in State/Province for a customer is Michigan, the value in Country for that customer will be United States. There is no province called Michigan in Canada.
  • If the value in State/Province is Ontario, the value in Country will be Canada. There is no state in the US called Ontario.

So the Country column is functionally dependent on State/Province. Knowing the value in the State/Province column allows us to predict the associated value in the Country column. We can also say that the State/Province column functionally determines Country. Note that the relationship does not have to flow both ways: knowing a customer's country of residence will not determine their state/province of residence. This situation makes State/Province a key field for this table.

  • A key field is a field that determines the value of at least one other field.
RepNum LastName FirstName
11223 Loman Willie
22334 Hill Harold
44556 Fate Henry

In small tables, it is tempting to find dependencies that may not be real. Assume the table above of sales people with three fields: RepNum. LastName, and FirstName. If we have only three sales people in the Rep table, we could conclude that RepNum (a field listing employee numbers) determines LastName, or that LastName determines RepNum. The key is to ask about the future, and to imagine what can and may happen. Hiring a new sales person with the same last name as a current sales rep should tell us that RepNum will determine LastName, but LastName will not determine any other column.

RepNum, in fact, determines every other field in the Rep table, since each sales person has a unique number. This makes RepNum a key for the Rep table. It is in fact the primary key for this table.

The rule for primary keys states that a column is the primary key if two things are true:

  • All columns are functionally dependent on the primary key
  • No subset of columns in the primary key functionally determines all other columns

This means that a primary key must determine all other columns. It is possible that you need a combination of two columns to define a primary key.

  • If the primary key consists of more than one column, no portion of the primary key may act as a primary key.
  • If the subset portion of the rule is violated, it means that you have not identified the correct primary key.

Candidate keys are those columns that could be chosen as the primary key. It is possible that more than one candidate key exists in a table, for instance, a table that holds both employee ID numbers and Social Security numbers could use either column as a primary key. Any candidate key that is not chosen to be the primary key is designated an alternate key.

Okay, with that understood, let's talk about normalization. The definition of first normal form says that a table may not contain repeating groups. This means that cells in a table may only contain single values. A standard cure for repeating groups is to split the records with repeating groups into multiple records. It may also be necessary to expand the primary key (include another column in the key).

The second problem with repeating groups is that the column representing the primary key cannot determine the value of the field containing the repeating groups. This becomes obvious when you expand such a row, making it multiple rows containing the same value for the primary key. Consider the table below:

Instructor Class Room
Adams CIS107a
CIS331
S107
S102
Jefferson CIS302a
CIS331
D204
D204

The Class and Room columns contain repeating values. The Instructor column is meant to be the primary key, but it does not meet the rule. We can fix this by expanding the primary key, adding a column for the second part of the new primary key, and splitting the values of the repeating groups into separate rows.

Instructor Day Class Room
Adams Wednesday CIS107a S107
Adams Thursday CIS331 S102
Jefferson Tuesday CIS302a D204
Jefferson Wednesday CIS331
D204

In the expanded table, the new primary key is the combination of the Instructor and Day columns. The combination determines the contents of the remaining columns. This table is now in first normal form.

Second normal form says that a table is in first normal form, and it has a new quality:

  • No column that is not part of the primary key is determined by a subset of the primary key.

Let's say that another way: if the primary key is made of more than one column, the non-key columns in the table cannot be determined by a mere portion of primary key. In the example above, the rule requires that Class and Room cannot be determined by Instructor or Day alone. This rule is met, so this table is in second normal form.

Note that since this rule describes tables whose primary key is a combination of columns, any table in first normal form that has a single column as its primary key is automatically in second normal form.

So, how do you find and fix tables that are not in 2NF? You examine tables for dependencies. Which columns are determined by which other columns? How much data in the table is repeated from one row to another? You use the same techniques when imposing Third normal form, so let's think about it for a minute.

Third normal form lets us use another term from the text.

  • determinant is a column (or set of columns) that determines the contents of another column. 
  • A table is in third normal form if the only determinants in it are candidate keys. (Remember that the primary key is also a candidate key, so that's fine.)

The procedure to cure a violation of this rule is simple: we copy the non-candidate determinant into a new table and move the columns that it determines to the new table. The non-candidate determinant becomes the primary key of the new table. It continues to exist in the old table, relating the two tables.

Problems are created when the correction procedures are not done correctly. Students should examine the examples in the text. Note that when the solution is to decompose a table into two or more tables, each resultant table must be in 3NF when you are done.

Consider this example from About.com:

CustNum FirstName LastName Address City State ZIP
1 John Doe 12 Main Street Sea Cliff NY 11579
2 Alan Johnson 82 Evergreen Tr Sea Cliff NY 11579
3 Beth Thompson 1912 NE 1st St Miami FL 33157
4 Jacob Smith 142 Irish Way South Bend IN 46637
5 Sue Ryan 412 NE 1st St Miami FL 33157

Some of the same address information is saved for multiple customers. That could be a problem when updating the database. Assume that we have lots of customers in each of a dozen cities. There is the potential for entering an error each time the same data element is entered. We can simplify data entry by removing the columns that are dependent on a single nonkey column and creating a new table. Since ZIP determines City and State, we remove the City and State columns from the table, and create a new table containing ZIP, City, and State.

ZIP City State
11579 Sea Cliff NY
33157 Miami FL
46637 South Bend IN

This allows us to simplify the original table:

CustNum FirstName LastName Address ZIP
1 John Doe 12 Main Street 11579
2 Alan Johnson 82 Evergreen Tr 11579
3 Beth Thompson 1912 NE 1st St 33157
4 Jacob Smith 142 Irish Way 46637
5 Sue Ryan 412 NE 1st St 33157

The two resultant tables are related on the ZIP column. (Using color theory, I have related the blue table above to the yellow table below on the green column.) When we need the City and State information for a customer, we look it up in the ZIP table, where each combination is stored only once.

When we place our data tables in 3NF, we avoid some classic problems found in tables that are not in first, second, or third normal form:

  • When data from a table like Sales Reps is represented in other tables by the Sales Rep number, we only have to update other attributes of a Sales Rep in one place, as opposed to many places. This simplifies updating problems.
  • Keeping all data about an entity in one short descriptive table and referencing that table by its key attribute in other tables also avoids inconsistent storage of the data, which is possible when the same value must be entered in many places.
  • Assume we did not have a separate table for sales reps, but stored the Sales Rep data in the Customer table. We could not add a new sales rep to the database without assigning that sales rep to one or more customers. If all the customers of an existing employee were reassigned, we would lose all our data about that sales person. In this respect, having separate tables for customers and sales reps allows us to add and delete from either, and allows for the possibilities of a customer without a rep and a rep without any customers.

Chapter 10 is like a lesson on networking. The main material begins on page 425. Much of it is outside the scope of this class, despite the authors' efforts to include this material.

The text makes the point that some network planning decisions should be made at high levels. It should not be left to branch offices to decide whether or not to use client server networks, whether to use the same software as the rest of the enterprise, or whether to connect their offices to each other in the same way that has been used throughout the rest of the enterprise. Decisions like these should be made for the entire enterprise, as should decisions on upgrades to such components.

Much of the material in this chapter should have been considered as constraints in your early planning and analysis phases. If your system design did not already account for the network infrastructure in which it must exist, your plan is in trouble.

It may be appropriate to consider whether you will use a thin or thick client design. Thin clients put most of the burden on a server, presenting only an interface on the user's workstation, and performing the real computing in a session on a server. Thick clients (also called fat clients) put much of the processing on the workstation, and mainly save data on a server. Some questions you must ask to make this choice:

  • What kind of hardware is already in place for the workstations and servers? Does either limit our choices?
  • What choices do we have for expanding the capacity of either workstations or servers? New computers? Upgrades?
  • How robust is our network? Can we expect a thin client to work in our network, or must processing be done locally?
  • Can the network throughput be improved?
  • Will this system be run through the Internet? Can our present web servers meet our needs?

The text spends several pages on network topology and protocols. This material is more properly covered in a class about networking.