CIS 251 - System Development Methods
Chapter 9: Data Design
Chapter 10: System Architecture
Objectives:
This lesson discusses material from chapters 9 and 10. Objectives
important to this lesson:
- Data design terms
- Data relationships
- Normalization
- Thoughts on networking
- Thin or thick client
- Assignments for the week
Concepts:
Chapter nine begins with a discussion about data structures,
and making a choice between file-oriented systems and
relational database systems.
The text offers an example of a file-oriented system:
a credit company might keep an ongoing file of transactions
completed by its customers, perhaps storing a separate file for each
day. There could be a batch process that runs "after
hours" that aggregates the transactions for each account, and updates
a customer file with the total change for each account.
This system could work if there is such a thing as "after hours",
which is unlikely in a world that offers goods for sale on the Internet.
It would also require that we would be satisfied with updating the customers'
records at the end of a chosen time period instead of completing live
updates as the transactions are made.
The text proposes that this older system approach could support a company
that chose to or needed to keep the records of each division of the
company separate from the others. This compartmentalization
presents a problem: redundant data elements probably
exist, in that each division would need to keep records on every customer.
This is wasteful in terms of data storage needs and
costs. It also leads to data integrity problems: an
update to a customer's record (change in name, address, balance, etc.)
would need to be made by each division if separate
file stores exist. It is quite likely that the data will be out of date
or otherwise incorrect unless each division is notified when any division
becomes aware of a request to change stored data. This invites inconsistent
data into our system. The text is also critical of the fact that each
data file is likely to be designed differently from
the others, which would lead to difficulty in retrieving
data from more than one file in a single query.
These are good reasons for the data industry to have moved on to more
modern, flexible, and resilient data structures. It makes sense to move
on to the kind of system you are more likely to build and support.
The text lists several arguments about why database management
systems (DBMSs) are more useful that file-oriented systems.
The list below may be easier to understand:
- Getting more information from the same
amount of data - You need to understand
what the author means by those two terms. Data may
be considered to be raw facts. Information is
what we get when we process data to gain knowledge. A DBMS may help
us see a relation between data elements that we did not think were
related.
- Sharing data - A DBMS will allow more users
to access the stored data than a spreadsheet system can.
- Balancing conflicting requirements - One of
the jobs of a DBA (Database Administrator) is to impose structure on
the database so that the company can get the most use out of it. Serving
the needs of the greatest number of users (the company as a whole)
is usually the goal.
- Controlling redundancy - Redundancy means
saving the same thing in several places. Saving a data element in
only one place conserves storage space, and makes updating the system
much simpler. Redundancy cannot be eliminated (which we will discuss
shortly), but it can be reduced.
- Facilitating consistency -
Consistency is a direct result of reducing redundancy. When a data
element is stored in fewer places, it is less likely to be stored
incorrectly, or fail to be updated.
- Improving integrity - Integrity
constraints are rules for
databases, such as a rule that a certain field must contain only numeric
data, or a field used in a query must contain a data element that
is in the database (think of a drop down list). DBMSs allow integrity
constraints.
- Expanding security - Security can be the assignment
of passwords for DBMS users, or the assignment of different rights
to various users or groups.
- Increasing productivity - In the old days,
a programmer would have to write a program for every query a user
wanted to run on a database. A DBMS will allow that programmer to
work on something else, once the users understand how to use the DBMS.
- Providing data independence - This means that
we can change the structure of the database, and add functions to
the DBMS, without having to reenter all the data.
The text gives us a list of components found in most DBMSs, most of
which are familiar:
- user interface - used by people who use but do
not control the system
The text mentions two common kinds of user interface. The first is
Query By Example (QBE), in which
users enter data in fields to find records that match. In general,
a query is a question that
you ask the DBMS. Whatever DBMS you use, there will be a structured
way to make a query. QBE allows you to make a query by filling in
a grid or template.
The second is any of several forms of Structured Query Language
(SQL), which is a command line based language used
to query and manage a database.
- administrator interface - used by people who create
and maintain databases; will contain a larger command set, containing
management commands that general users will not be allowed to use
- related system interface - used for communication
between two or more systems that are allowed to interface with each
other
- data manipulation language - an interface that
provides specific data control to administrators,
such as storing, backing up, and updating data
- schema - The text has already described a data
dictionary, which is a repository of information about all the data
structures in a database. A schema
includes these features, but is different because it also includes
rules about these features that control what fields
are allowed in certain views, what operations may be carried out by
different users, and what users are allowed to do in general. A subschema
further defines what the database may do for an individual user type
or group.
The text also presents some thoughts about web-based data design. It
should be an assumption that any commercial entity will want to have
a presence on the Internet, and most non-commercial entities will want
one as well.
Let's move ahead to more details about databases. First we enjoy some
basic terms:
- entity - something that you want to store
data about, such as customers, accounts, employees, inventory, places,
or events
- attribute - a characteristic of an entity;
also called a field, column,
or property
- record - also called a tuple, but only by textbooks;
a set of all the attributes of one instance
in a data file: if the data file is about students (the entity),
the record of a particular student (an instance of
the entity) consists of all the information stored about that student
- relationship - an association between entities
in your database. Customers are associated with their sales reps,
in that each customer has one and only one sales rep. This is a relationship.
From the point of view of the sales rep, this is a one
to many relationship, since each sales
rep has many customers, but each customer has only one sales rep.
Relationships can be one to one, one to many,
many to one, or many to many.
The text discusses displaying these relationships as Entity
Relationship Diagrams on pages 360 through 364.
- data file - all data in a database is stored
in files, including lists of entities, their attributes, and the relationships
between entities
Normalization is the process of making your data
tables comply with several rules. Compliance with each successive rule
takes the tables to the next level of normalization. The levels are
called normal forms, specifically first,
second, third, and fourth normal form. The forms can also
be referred to by the symbols 1NF, 2NF, 3NF,
and 4NF.
The first objective for understanding normalization should be to define functional
dependence.
- A column is functionally dependent on another column
if the value found in the second column determines the
value found in the first column.
Suppose we had a database about customers who live in either Canada or
the United States. Let's have a column in the
customer table called State/Province, and another
column called Country. In case you are not aware
of it, all the states and provinces have unique names.
- If the value in State/Province for a
customer is Michigan, the value in Country for
that customer will be United States. There is
no province called Michigan in Canada.
- If the value in State/Province is Ontario,
the value in Country will be Canada.
There is no state in the US called Ontario.
So the Country column is functionally dependent on
State/Province. Knowing the value in the State/Province column allows
us to predict the associated value in the Country column. We can also
say that the State/Province column functionally determines Country.
Note that the relationship does not have to flow both ways: knowing
a customer's country of residence will not determine their state/province
of residence. This situation makes State/Province a key
field for this table.
- A key field is a field that determines
the value of at least one other field.
RepNum |
LastName |
FirstName |
11223 |
Loman |
Willie |
22334 |
Hill |
Harold |
44556 |
Fate |
Henry |
In small tables, it is tempting to find dependencies that may not be
real. Assume the table above of sales people with three fields: RepNum.
LastName, and FirstName. If we have only three sales people in the Rep
table, we could conclude that RepNum (a field listing employee numbers)
determines LastName, or that LastName determines RepNum. The key is
to ask about the future, and to imagine what can and may happen. Hiring
a new sales person with the same last name as a current sales rep should
tell us that RepNum will determine LastName, but LastName will not determine
any other column.
RepNum, in fact, determines every other field in the Rep table, since
each sales person has a unique number. This makes RepNum a key for
the Rep table. It is in fact the primary key for
this table.
The rule for primary keys states that a
column is the primary key if two things are
true:
- All columns are functionally dependent on
the primary key
- No subset of columns in the primary key functionally
determines all other columns
This means that a primary key must determine all other columns.
It is possible that you need a combination of two columns
to define a primary key.
- If the primary key consists of more than one column,
no portion of the primary key may act as a primary
key.
- If the subset portion of the rule is violated, it means that you
have not identified the correct primary key.
Candidate keys are those columns that could
be chosen as the primary key. It is possible that more than
one candidate key exists in a table, for instance, a table that holds
both employee ID numbers and Social Security numbers could use either
column as a primary key. Any candidate key that is not chosen to be
the primary key is designated an alternate key.
Okay, with that understood, let's talk about normalization.
The definition of first normal form says
that a table may not contain repeating
groups. This means that cells in a table may only contain single
values. A standard cure for repeating groups
is to split the records with repeating groups
into multiple records. It may also be necessary to expand
the primary key (include another column in the key).
The second problem with repeating groups is that the
column representing the primary key cannot determine the value of the
field containing the repeating groups. This becomes obvious when you
expand such a row, making it multiple rows containing the same value
for the primary key. Consider the table below:
Instructor |
Class |
Room |
Adams |
CIS107a
CIS331 |
S107
S102 |
Jefferson |
CIS302a
CIS331
|
D204
D204 |
The Class and Room columns contain repeating values. The Instructor
column is meant to be the primary key, but it does not meet the rule.
We can fix this by expanding the primary key, adding
a column for the second part of the new primary key, and splitting
the values of the repeating groups into separate rows.
Instructor |
Day |
Class |
Room |
Adams |
Wednesday |
CIS107a |
S107 |
Adams |
Thursday |
CIS331 |
S102 |
Jefferson |
Tuesday |
CIS302a |
D204 |
Jefferson |
Wednesday |
CIS331
|
D204 |
In the expanded table, the new primary key is
the combination of the Instructor and Day columns.
The combination determines the contents of the remaining columns.
This table is now in first normal form.
Second normal form says that a table is in first
normal form, and it has a new quality:
- No column that is not part of the primary
key is determined by a subset of the primary
key.
Let's say that another way: if the primary key is made of more than
one column, the non-key columns in the table cannot be determined by
a mere portion of primary key. In the example above, the rule requires
that Class and Room cannot be determined by Instructor or Day alone.
This rule is met, so this table is in second normal form.
Note that since this rule describes tables whose primary key is a combination
of columns, any table in first normal form that has a single column
as its primary key is automatically in second normal form.
So, how do you find and fix tables that are not in 2NF? You examine
tables for dependencies. Which columns are determined
by which other columns? How much data in the table is repeated from
one row to another? You use the same techniques when imposing Third
normal form, so let's think about it for a minute.
Third normal form lets us use another term from
the text.
- A determinant is a column (or set of
columns) that determines the contents of another column.
- A table is in third normal form if the only determinants
in it are candidate keys. (Remember that the primary key
is also a candidate key, so that's fine.)
The procedure to cure a violation of this rule is simple: we copy the
non-candidate determinant into a new table and move the
columns that it determines to the new table. The non-candidate determinant
becomes the primary key of the new table. It continues to exist in the
old table, relating the two tables.
Problems are created when the correction procedures are not done correctly.
Students should examine the examples in the text. Note that when the
solution is to decompose a table into two or more
tables, each resultant table must be in 3NF when you are done.
Consider this example from About.com:
CustNum |
FirstName |
LastName |
Address |
City |
State |
ZIP |
1 |
John |
Doe |
12 Main Street |
Sea Cliff |
NY |
11579 |
2 |
Alan |
Johnson |
82 Evergreen Tr |
Sea Cliff |
NY |
11579 |
3 |
Beth |
Thompson |
1912 NE 1st St |
Miami |
FL |
33157 |
4 |
Jacob |
Smith |
142 Irish Way |
South Bend |
IN |
46637 |
5 |
Sue |
Ryan |
412 NE 1st St |
Miami |
FL |
33157 |
Some of the same address information is saved for multiple customers.
That could be a problem when updating the database. Assume that we have
lots of customers in each of a dozen cities. There is the potential
for entering an error each time the same data element is entered. We
can simplify data entry by removing the columns that are dependent
on a single nonkey column and creating a new table. Since ZIP
determines City and State, we remove the City and State
columns from the table, and create a new table containing ZIP, City,
and State.
ZIP |
City |
State |
11579 |
Sea Cliff |
NY |
33157 |
Miami |
FL |
46637 |
South Bend |
IN |
This allows us to simplify the original table:
CustNum |
FirstName |
LastName |
Address |
ZIP |
1 |
John |
Doe |
12 Main Street |
11579 |
2 |
Alan |
Johnson |
82 Evergreen Tr |
11579 |
3 |
Beth |
Thompson |
1912 NE 1st St |
33157 |
4 |
Jacob |
Smith |
142 Irish Way |
46637 |
5 |
Sue |
Ryan |
412 NE 1st St |
33157 |
The two resultant tables are related on the ZIP column.
(Using color theory, I have related the blue table above to the yellow
table below on the green column.) When we need the City and State information
for a customer, we look it up in the ZIP table, where each combination
is stored only once.
When we place our data tables in 3NF, we avoid some classic problems
found in tables that are not in first, second, or third normal form:
- When data from a table like Sales Reps is represented in other tables
by the Sales Rep number, we only have to update other attributes of
a Sales Rep in one place, as opposed to many places. This simplifies
updating problems.
- Keeping all data about an entity in one short descriptive table
and referencing that table by its key attribute in other tables also
avoids inconsistent storage of the data, which is
possible when the same value must be entered in many places.
- Assume we did not have a separate table for sales reps, but stored
the Sales Rep data in the Customer table. We could not add a new sales
rep to the database without assigning that sales rep to one or more
customers. If all the customers of an existing employee were reassigned,
we would lose all our data about that sales person. In this respect,
having separate tables for customers and sales reps allows us to add
and delete from either, and allows for the possibilities of a customer
without a rep and a rep without any customers.
Chapter 10 is like a lesson on
networking. Much of it is outside the scope of this
class, despite the authors' efforts to include this material.
The text makes the point that some network planning decisions should
be made at high levels. It should not be left to branch
offices to decide whether or not to use client server
networks, whether to use the same software as the rest
of the enterprise, or whether to connect their offices
to each other in the same way that has been used throughout the rest
of the enterprise. Decisions like these should be made for the entire
enterprise, as should decisions on upgrades to such
components.
Much of the material in this chapter should have been considered as
constraints in your early planning and analysis phases.
If your system design did not already account for the network infrastructure
in which it must exist, your plan is in trouble.
It may be appropriate to consider whether you will use a thin or thick
client design. Thin clients put most of the burden
on a server, presenting only an interface
on the user's workstation, and performing the real computing in a session
on a server. Thick clients (also called fat
clients) put much of the processing on the
workstation, and mainly save data
on a server. Some questions you must ask to make this
choice:
- What kind of hardware is already
in place for the workstations and servers?
Does either limit our choices?
- What choices do we have for expanding
the capacity of either workstations or servers? New
computers? Upgrades?
- How robust is our network? Can we expect a thin
client to work in our network, or must processing be done locally?
- Can the network throughput be improved?
- Will this system be run through the Internet? Can
our present web servers meet our needs?
The text spends several pages on network topology and protocols. This
material is more properly covered in a class about networking.