CS 422 - Database Programming I

Chapter 1, Introduction to Oracle Database

Objectives:

This lesson introduces major components of Oracle. Objectives important to this lesson:

  1. Oracle database architecture
  2. Physical database structures
  3. Logical database structures
  4. Schemas and schema objects
  5. Relational models and normalization
Concepts:

Chapter 1

This text presents a challenge to the practiced reader. Normally you look for the important information in a text which may stand out because of the font or organization of the paragraphs. The author seems to have taken an different approach in the first chapter, where we see that every idea has a new paragraph with only a few sentences, and every paragraph has a new section heading. Too much emphasis?

Let's try to figure out what might be important in chapter 1.

Oracle database architecture

Oracle is a DBMS (Database Management System), also called an RDBMS (Relational DBMS). An Oracle DBMS includes data files (tables), control files, and log files. It is used by two components in the simplest installation: the client and the server. You typically install the server portion on an actual application server on a network, and you install the client software on a workstation. When you use the student version of the product, you will probably install both parts on the same computer. This is fine for learning the product, but not very functional for an enterprise.

In an enterprise environment, we would expect to see the three components listed on page 1-3:

  • client - the user interface where a user authenticates to the system and sends queries to the database
  • application server - a gateway to the database that can perrform user authentication, process some queries, and pass more involved queries to the database server
  • database server - this is the heart of the system that stores and uses database files, indexes, and other database objects.

When the DBMS runs, it is running an instance of the database on the server. Think of an instance as one copy of the program running in memory. An instance implies a copy of the RDBMS, memory allocated to the instance, and access to data files. A powerful server can run multiple instances of the RDBMS at the same time. This is similar to a user running multiple instances of a browser or a word processor on one computer. A server can be:

  • a single instance server - one instance of the RDBMS runs on the server to access the database
  • a multiple instance server - one server runs two or more instances of the RDBMS, each of which access separate data files
  • one of several clustered servers - each server in a cluster runs an instance of the RDBMS, is managed by software on the Cluster Manager server, and has access to a file server (possibly on a Storage Area Network or Network Attached Storage) which has all the files needed by the various instances.

In each case above, the various instances are called into existence by a client making a request to connect to the database. A user typically runs an application on a workstation, which creates a user process, which serves as the user's connection to a server process whose purpose is to support this connection.

Physical database structures

The text discusses five types of physical files that you may find in an Oracle database:

  • Datafiles - files that store data (duh?) in the daatabase; each file can be associated with only one database or tablespace (which the text discusses in two pages, because a tablespace is a logical structure); datafiles typically have the extension .ora or .dbf
  • Control files - a control file contains meta-information about a database, such as the features listed on page 1-4, which include database name, time of creation, and names of associated files; control files typically have the extension .ctl
  • Redo log files - these files are logs of the transactions that have been requested on a database; the logs include committed (saved) transactions and uncommitted (not saved) transactions; redo log files can be stored online or offline; like most transaction files, these can be used when recovering from a system or an instance failure
  • Parameter files - files that contain startup parameters used when starting an instance of the database
  • Alert and trace log files - trace logs store errors that occur during the use of the system; alert log files contain not only error messages,but also events the system associates with the error
Logical database structures
  • Data block - a logical unit of storage; it represents the smallest unit of logical storage; like the concept of storage blocks on a hard drive, but related to Oracle programs. Oracle reads and writes data in blocks that can range in size from 2 KB to 32 KB. This block size can be set by the administrator. Set larger sizes if you have lots of RAM and fast hard drives, set smaller block sizes for slower drives and less RAM
  • Extent - a contiguous group of data blocks. Objects in the database, such as tables, are allocated an "extent" when created. When they grow beyond their initial allocation, they must be allocated another extent to expand into
  • Segment - a n object can have one or more extents, but all of its extents are contained in its segment. Each object will have one segment (except for partitioned tables and indexes, which have one segment per partition)
  • Tablespace - t he collection of all schema objects needed for a database. Database objects must be associated with one tablespace, but a tablespace can contain many database objects; the default tablespace is called SYSTEM

So, Oracle stores data in blocks. There are typically several blocks in an extent. There could be several extents for an object as it grows. All of an object's extents are contained in its segment. Segments usually have a one to one correspondence with schema objects . There could be any number of schema objects in a tablespace. Each schema is owned by a user, and contains all the database objects they have created.

Schemas and schema objects

As noted above, a schema in Oracle relates to the collection of database objects a user has created. Several types are listed in the text:

  • tables - data files, typically organized in rows and columns, in which rows are cases and columns are attributes
  • indexes (indices) - a table may have an associated index file that contains that name of a column, the number of each row in the table, and a number that indicates in what order those rows should be displayed; this method of specifying a display order means that the table need not be sorted again when the index is used
  • views - Oracle defines a view as "A selective presentation of the structure and data of one or more tables. Views can also be based on other views." This is a useful feature when you want to give a user access to part of a database, but not to all of it.

The text considers several features of Oracle through page 1-9. These features are not unique to Oracle, so we do not need to stress them.

Relational models and normalization

The text spends several pages on general database design and normalization, which I hope you have seen before. In case you have not, here are some thoughts about those ideas from another text:

Normalization is the process of making your data tables comply with several rules. Compliance with each successive rule takes the tables to the next level of normalization. The levels are called normal forms, specifically first, second, third, and fourth normal form. The forms can also be referred to by the symbols 1NF, 2NF, 3NF, and 4NF.

The first objective of the chapter is to define functional dependence. A column is functionally dependent on another column if the value found in the second column determines the value found in the first column. Suppose we had a database about customers who live in either Canada or the United States. Let's have a column in the customer table called State/Province, and another column called Country. If the value in State/Province for a customer is Michigan, the value in Country for that customer will be United States. If the value in State/Province is Ontario, the value in Country will be Canada. So the Country column is functionally dependent on State/Province. Likewise, the State/Province column functionally determines Country. Note that the relationship does not have to flow both ways: knowing a customer's country of residence will not determine their state of residence.

In small tables, it is tempting to find dependencies that may not be real. The text suggests that since we have only three sales people in the Rep table, we could conclude that RepNum determines LastName, and LastName determines RepNum. The key is to ask about the future, and to imagine what can and may happen. Hiring a new sales person with the same last name as a current sales rep shows that RepNum will determine LastName, but LastName will not determine any other column.

RepNum, in fact, determines every other field in the Rep table, since each sales person has a unique number. This makes RepNum a key for the Rep table. It is in fact the primary key for this table.

The rule for primary keys states that a column is the primary key if two things are true:

  • All columns are functionally dependent on the primary key
  • No subset of columns in the primary key functionally determines all other columns

This means that a primary key must determine all other columns. It also means that if the primary key consists of more than one column, no portion of the primary key may act as a primary key. If the subset portion of the rule is violated, it means that you have not identified the correct primary key.

Candidate keys are those columns that could be chosen as the primary key. It is possible that more than one candidate key exists in a table, for instance, the book suggests that a table that holds both employee ID numbers and Social Security numbers could use either column as a primary key. Any candidate key that is not chosen to be the primary key is designated an alternate key.

The definition of first normal form says that a table may not contain repeating groups. This was included in earlier chapters as the statement that cells in a table may only contain single values. A standard cure for repeating groups split the records with repeating groups into multiple records.

It may also be necessary to expand the primary key. The second problem with repeating groups is that the column representing the primary key does not determine the value of the field containing the repeating groups. This becomes obvious when you expand such a row, making it multiple rows containing the same value for the primary key. Assume the following table:

Instructor Class Room
Vincent CIS107a
CIS331
S107
S102
Scott CIS302a
CIS331
D204
D204

The Class and Room columns contain repeating values. The Instructor column is meant to be the primary key, but it does not meet the rule. We can fix this by expanding the primary key, adding a column for the second part of the new primary key, and splitting the values of the repeating groups into separate rows.

Instructor Day Class Room
Vincent Wednesday CIS107a S107
Vincent Thursday CIS331 S102
Scott Tuesday CIS302a D204
Scott Wednesday CIS331
D204

In the expanded table, the new primary key is the combination of the Instructor and Day columns. The combination determines the contents of the remaining columns. This table is now in first normal form.

Second normal form says that a table is in first normal form, and it has a new quality: No column that is not part of the primary key is determined by a subset of the primary key. Let's say that another way: if the primary key is made of more than one column, no column in the table can be determined by a mere portion of primary key. In the example above, the rule requires that Class and Room cannot be determined by Instructor or Day alone. This rule is met, so this table is in second normal form.

Note that since this rule describes tables whose primary key is a combination of columns, any table in first normal form that has a single column as its primary key is automatically in second normal form.

So, how do you find and fix tables that are not in 2NF? You examine tables for dependencies. Which columns are determined by which other columns? How much data in the table is repeated from one row to another? You do the same thing when imposing Third normal form, so let's think about it for a minute.

Third normal form lets us use another term from the text. A determinant is a column (or set of columns) that determines the contents of another column. A table is in third normal form if the only determinants in it are candidate keys. (Remember that the primary key is also a candidate key.) The procedure to cure a violation of this rule is simple: we copy the non-candidate determinant into a new table and move the columns that it determines to the new table. The non-candidate determinant becomes the primary key of the new table. It continues to exist in the old table, relating the two tables.

Problems are created when the correction procedures are not done correctly. Students should examine the examples in the text. Note that when the solution is to decompose a table into two or more tables, each resultant table must be in 3NF when you are done.

Consider this example from About.com:

CustNum FirstName LastName Address City State ZIP
1 John Doe 12 Main Street Sea Cliff NY 11579
2 Alan Johnson 82 Evergreen Tr Sea Cliff NY 11579
3 Beth Thompson 1912 NE 1st St Miami FL 33157
4 Jacob Smith 142 Irish Way South Bend IN 46637
5 Sue Ryan 412 NE 1st St Miami FL 33157

Like the example in the text, there is repeated data that could be a problem when updating the database. Assume that we have lots of customers in each of a dozen cities. There is the potential for entering an error each time the same data element is entered. We can simplify data entry by removing the columns that are dependent on a single nonkey column and creating a new table. Since ZIP determines City and State, we remove the City and State columns from the table, and create a new table containing ZIP, City, and State.

ZIP City State
11579 Sea Cliff NY
33157 Miami FL
46637 South Bend IN

This allows us to simplify the original table:

CustNum FirstName LastName Address ZIP
1 John Doe 12 Main Street 11579
2 Alan Johnson 82 Evergreen Tr 11579
3 Beth Thompson 1912 NE 1st St 33157
4 Jacob Smith 142 Irish Way 46637
5 Sue Ryan 412 NE 1st St 33157

The two resultant tables are related on the ZIP column. (Using color theory, I have related the blue table above to the yellow table below on the green column.)

When we place our data tables in 3NF, we avoid some classic problems found in tables that are not in first, second, or third normal form:

  • When data from a table like Sales Reps is represented in other tables by the Sales Rep number, we only have to update other attributes of a Sales Rep in one place, as opposed to many places. This simplifies updating problems.
  • Keeping all data about an entity in one short descriptive table and referencing that table by its key attribute in other tables also avoids inconsistent storage of the data, which is possible when the same value must be entered in many places.
  • Assume we did not have a separate table for sales reps, but stored the Sales Rep data in the Customer table. We could not add a new sales rep to the database without assigning that sales rep to one or more customers. If all the customers of an existing employee were reassigned, we would lose all our data about that sales person. In this respect, having separate tables for customers and sales reps allows us to add and delete from either, and allows for the possibilities of a customer without a rep and a rep without any customers.

The discussion of fourth normal form requires another term. If each value of column A has several associated values of column B, then B is multidependent on A. This is also called a multivalued dependency. Fourth normal form requires that we have third normal form, and no multivalued dependencies.

To avoid the multidependency problem, the text shows us a new method for attaining 1NF. In the first examples, we removed repeating groups and expanded the primary key. That will not work if we have multiple repeating groups. In the text example, we have a Faculty table whose primary key is FacultyNum. It includes FacultyName, and two repeating groups: StudentNum, StudentName, and CommitteeCode, CommitteeName. The idea is that each faculty member will belong to several committees, and will be the adviser for several students. In addition, a student may have several advisers due to multiple majors.

The method to resolve the problems of such a table is to spin off several tables. Solution: remove each repeating group to a new separate table, copy the primary key of the original table to the new table, and establish a new primary key for that table consisting of the copied key column and the key for the repeating group.

This results in simpler tables:
Faculty (FacultyNum, FacultyName)
FacStudent (FacultyNum, StudentNum, StudentName)
FacCommittee (FacultyNum, CommitteeCode, CommitteeName)

Continue with the rest of the chapter and complete the assignment below.

Week 1 Assignment: Chapter 1

  • Read Chapter 1
    • Laugh, if you can.
    • Answer the Review Questions (1-5) that appear on pages 1-19 and 1-20.
  • Read Chapter 2 for next week.