Beginning of Test 2 Material
March 3, 2003
Normalization is an alternative database design tool to data modeling. It was developed as part of theoretical foundation for the relational model application of a series of rules that gradually improve the design.
Data Modeling first emerged as E-R modeling in a
paper by Chen. There were two main concepts
Normalization is a method for increasing the
quality of database design. It is also a theoretical base for defining the
properties of relations. Normalization is useful cross check for ensuring the
soundness of a data model.
Relations can be classified by the types of
modification anomalies to which they are vulnerable. Meaning certain types of
tables can be prone to insertion anomalies. Some tables are prone to deletion
anomalies and others are prone to an update anomaly. Research was done in the
1970’s to classify these types of tables and to find a set of techniques that
would prevent these anomalies. The technique that was developed was
normalization.
Terminology for
Normalization:
One or more attributes determine the value of another attribute.
An identifier functionally determines all the attributes of an entity.
Basically if we are given the value of one attribute or set of attributes, we can look up the value for the other attribute.
It is denoted as
Stock code → firm name, stock price, stock quantity, stock dividend
In other words if we know the stock code we know the value of firm name etc.
An attribute, or set of attributes, that fully functionally determines another attribute is called a determinate. So in referring to stock code above we see that it is a determinant because it fully functionally determines stock PE.
Unlike a key a determinate need not be unique. In other words an attribute can determine another attribute. The example in the book shows that “course type determines fee.”
An identifier (key) can be and should be the determinate, however a determinate is not necessarily a key.
There are situations were a given value (attribute) determines multiple values.
A multi-determines B. Denoted as A →→B.
Example: Department multi-determines course.
There is a one – to – one relationship
A value of an attribute determines the value of another attribute and vice versa
Denoted as A →B and B →A.
When two attributes have a 1:1 relationship, they must occur together in at least one table in a database so that their equivalence is a recorded fact
There is a one to much relationship
A value of an attribute determines the value of another attribute but not vice versa
Denoted as A →B
Example:
As a result, if A and B occur in the same table, then A must be the key
There is a many to many relationship
Denoted as A not →B and B not →A
Example:
French and Flemish is spoken in Belgium and French is spoken in many countries
To record the m:m relationship between these attributes, a table containing both attributes as a composite key is required.
There are five levels of normalization.
The first level says A relation is in first normal
form if and only if all columns are single-valued.
In other words if you have table with single values in the columns the data model is in first normal form.
The second level says A relation is in second normal form if and only if it is in first normal form, and all non key columns are dependent on the key.
According to this definition, if a relation has a single
attribute as its key, then it is automatically in second normal form, by
default, every nonkey attribute is dependent on all of the key. Thus, second
normal form is of concern only in relations that have composite keys.
Slide 11 –
Second normal form is:
CUSTOMER-CREDIT in this case
|
ORDER |
|
|
|
ITEMNO |
CUSTOMERID |
QUANTITY |
CUSTOMER-CREDIT |
|
12 |
57 |
25 |
OK |
|
34 |
679 |
3 |
POOR |
So to fix this problem we would break the ORDER table back to the original table structure of ITEM and CUSTOMER
The third level says A relation is in third normal
form if and only if it is in second normal form and has no transitive
dependencies.
Transitive dependencies….
March 5, 2003
Transitive dependencies finally addressed. A transitive dependency occurs when part of the attributes are determined by the key and then those attributes which are determined by the key are the determinates for another attribute
Example
SID |
Building |
Fee |
|
100 |
Jesse Hall |
3200 |
|
150 |
Graig Hall |
3100 |
|
200 |
Jesse Hall |
3200 |
|
250 |
Jesse Hall |
3200 |
The functional dependency is that Building determines Fee and SID determines building.
This is remedied by splitting the table into two tables.
SID |
Building |
|
100 |
Jesse Hall |
|
150 |
Graig Hall |
|
200 |
Jesse Hall |
|
250 |
Miller Hall |
|
300 |
Jesse Hall |
|
Building |
Fee |
|
Jesse Hall |
3200 |
|
Graig Hall |
3100 |
|
Jesse Hall |
3200 |
|
Miller Hall |
3100 |
|
Jesse Hall |
3200 |
What happens when you have:
You cry and tear your hair …then you deal with it by inventing a stronger version of 3NF (third normal form).
The precise definition is a relation is in Boyce-Codd
normal form if and only if every determinant is a candidate key.
What is a candidate key?
By looking at the table below we see that by combining SID and Major we can come up with Fname. Remember when we use two attributes in combination this is called a composite key. The combination SID and Fname will get us Major. The combinations (SID , Major) and (SID, Fname) are candidate keys. The combination we pick to use will be the primary key.
Example:
This table is in third normal form however it fails the Boyce-Codd test.
|
SID |
Major |
Fname |
|
100 |
Math |
Cauchy |
|
150 |
Psychology |
Jung |
|
200 |
Math |
Riemann |
|
250 |
Math |
Cauchy |
|
300 |
Psychology |
Perls |
|
300 |
Math |
Reimann |
Primary key(s) SID, Major
Key (candidate) SID, Fname
Functional Dependencies Fname determines Major
Solution:
|
SID |
Fname |
|
100 |
Cauchy |
|
150 |
Jung |
|
200 |
Riemann |
|
250 |
Cauchy |
|
300 |
Perls |
|
300 |
Reimann |
|
Fname |
Subject |
|
Cauchy |
Math |
|
Jung |
Psychology |
|
Riemann |
Math |
|
Cauchy |
Math |
|
Perls |
Psychology |
|
Reimann |
Math |
Normalization continued
The precise definition is a relation is in fourth
normal form if it is in BCNF and has no multi-value dependencies.
|
Student |
|
|
|
|
Student ID |
Sport |
Subject |
… |
|
50 |
Football |
English |
… |
|
50 |
Football |
Music |
… |
|
50 |
Tennis |
Botany |
… |
|
50 |
Karate |
Botany |
… |
What is a multi-value dependency?
Refer to the above table. Note that Student ID can participate in many sports so Sport does not functionally depend on Student ID. In addition Student ID can participate in many subjects. In general a multi-value dependency exists when a relation has at least three attributes, two of them are multi-value, and their values depend on the third attribute.
In other words, in a relation R(A,B,C), a multi-value dependency exists if A determines multiple values of B, A determines multiple values of C and B and C are independent of each other.
Multi value dependencies lead to modification anomalies. The solution is to break the one table into two tables.
<![if !supportEmptons, as we have been doing, but then cannot be reconstructed. The condition under which this situation arises has no clear, intuitive meaning. We do not know what the consequences of such dependencies are or even if they have practical consequences…”
The precise definition is a relation is in domain
key/normal form if and only if every constraint on the relation is a logical
consequence of the domain constraints and the key constraints that apply to the
relation.
What are example of constraints?
What is a domain?
Unfortunately, there is no known algorithm for converting a relation to DK/NF, nor is it even known which relations can be converted to DK/NF. Finding, or designing, DK/NF relations is more of an art than a science.
In spite of this, in the practical work of database design, DK/NF is an exceedingly useful design objective. If we can define relations in a way that constraints on them are logical consequences of domains and keys, then there will be no modification anomalies.
Example 1 of Domain/Key Normal Form
STUDENT(SID, GradeLevel, Building, Fee)
Key: SID
Constraints: Building → Fee
SID must not begin with digit 1
We know:
The question is how do we create relations from the above information so that there will be no modification anomalies?
We give the domains for the attributes definitions:
SID in CDDD, where C is a decimal digit not = 1; D = decimal digit
GradeLevel in {FR, SO, JR, SN, GR}
Building in CHAR(4)
Fee in DEC(4)
STUDENT (SID, GradeLevel, Building)
Key: SID
BLDG-FEE (Building, Fee)
Key: Building
The E-R (Entity –Relationship) model
One of the most widely known data modeling methods is the entity-relationship model, developed by Chen. There is no standard for the E-R model, and it has been extended and modified in a number of ways.
Typically the shapes mean the following:
Rectangles represent entities
Diamonds represent relationships with the cardinality encapsulated inside the diamond
1:1 (one to one), 1:N (one to many) and M:N (many to many). One important difference is that a m:m is not shown as an associative entity; thus the database designer must convert this relationship to a table.
Ellipses represent attributes
The ways to represent relationships is varied (see page 222 of text) Some modeling techniques are very refined in their approach and other data modeling mediums have loose definition of relationships. Which ever model is chosen the same goal remains…improving database design. The data model chosen for use in this book was chosen for its simplicity and high quality results.
The Relational Model
So far this semester we have covered data modeling, normalization and E-R diagram modeling of database design. This chapter officially introduces the relational model. We have touched on components of the relational model but now we will bring the components together.
The relational model was developed as a result of recognized short comings of hierarchical and network DBMSs. According to Codd there are three good reasons for adopting the relational mode over other database models
Objectives of relational model research
The relational model has three components:
Domain – The set of values all of the same data type
Relations – a table of n columns and m rows. Each column has a unique name, and all the values in a column are drawn from the same domain. Each row is uniquely identified.
The cardinality of a relation is its number of rows. The degree of a relation is the number of columns. Hence one can expect that the cardinality of a relation will change but the degree shall not. The degree changes if a column is added to a relation, but in terms of relational theory, it is considered to be a new relation.
A relational database collection of relations or tables. Tables are linked by a common tuple that need not have the same name in both tables but it must draw on the same domain.
The primary key is the unique identifier
In some situations there may be several attributes – known
as candidate keys – that are potential primary keys. One of the
candidate keys is chosen as the primary and the candidate keys remaining are
then referred to as alternate keys.
A foreign key is an attribute (possibly composite) of a relation that is also a primary key of a relation. The foreign key and primary key may be in different relations or the same relation, but both keys should be drawn from the same domain.
No component of the primary key of a relation may be
null.
This is the entity integrity rule. This rule ensures that each instance of an entity described in a relation is identifiable in some way. Its implementation means that each row in a relation can be uniquely identified.
In short this enforces the definition of a foreign key. The referential integrity rule says that a foreign key must have a corresponding primary key.
We have a pretty good idea what a relational database consists of. We have been working with them all semester. According to our text there are four ways to manipulate relational tables.
We are fairly familiar with SQL and QBE now we will take a closer look at relational algebra and relational calculus.
While relational algebra can be used to resolve queries, it is seldom employed because it requires you to specify what you want and how to get it. This makes relational algebra more difficult to use than SQL.
Relational calculus is a non-procedural language, there for it is less concerned with how you get information. While relational calculus over comes some of the shortcomings of relational algebra it is harder to learn. As a result language designers developed SQL and QBE.
Relational algebra is the standard for judging data retrieval languages. If a language, such as SQL, can be used to express every relational algebra operator, it is said to be relationally complete,
Restrict - Creates a new table from specified rows of an existing table
Project- Creates a new table from specified columns of an existing table
Product – Creates a new table from all of the existing rows of two existing rows. If Table A has 2 columns and 3 rows; Table B has 3 columns and two rows. The product of A and B will have 5 columns and six rows.
Union – Creates a new table containing rows appearing in one or both tables of two existing tables. The relations must be union compatible, which means that they have the same column names, in the same order, and drawn on the same domain. Union is the sum of the rows in both tables. If it is in Table A or Table B it will be in Table C less the duplications.
Intersect – Creates a new table containing rows
appearing in both tables of two existing tables. If it is in Table A and it is in Table B it will be in
the resultant table. The relations must be union compatible.
Difference – Creates a new table containing rows appearing in one table but not in the other of two existing tables. The difference of two relations is a new relation containing all rows appearing in the first relation but not the second.
Join – creates a new relation from two relations for
all combinations of rows satisfying the join condition. Technically a join is a
product of two tables followed by a restrict.
Divide – Division requires that A and B have a set of attributes that are common to both relations.
The full set of relational operators is not required. Join, Divide and Intersect can be described and done by using combinations of the other operators. Hence there are five primitive relational algebra operators.
Comparison of relational algebra primitive operations and SQL:.
Operation |
Relational algebra |
SQL |
|
Restrict |
A where condition |
SELECT * FROM A WHERE condition |
|
Project |
A[X] |
SELECT X FROM A |
|
Product |
A times B |
SELECT * FROM A, B |
|
Union |
A union B |
SELECT * FROM A UNION SELECT * FROM B |
|
Difference |
A minus B |
SELECT * FROM A WHERE NOT EXISTS (SELECT * FROM B WHERE A.X = B.X AND A.Y = B.Y AND … |
A relational database has structures, integrity rules and a manipulation language. A fully relational database system provides complete support for each of these components. Many commercial systems support SQL, but do not provide support for domains or integrity rules. Such systems are not fully relational but are relationally complete
The proponents of OO maintain that ODBMS offers several advantages over current implementation of relational database technology. They assert that the shortcomings of the relational model are
It should be noted that there is considerable disagreement between representatives of OO and relational adherents. This chapter tries to present a balance of the view points.
Unified Modeling Language was introduced in 1997 by the Object Management Group. UML has become the common modeling language for OO software development. UML standardizes the notation for describing processes and has nine predefined diagrams.
We will focus on the class diagram because of its correspondence to the data model.
E-business developers are using UML to meet customer demands.
Traditional relational database structure in not designed to handle complex data structures such as
Object Oriented Programming Languages (OOPL) have been in the making for about 40 years. Smalltalk was developed by Xerox PARC in the 1970’s and introduced the terms object and OO. As you know OO programming has become the standard in many development arenas these days.
Key OO concepts
OO applications are created by assembling and using objects – self contained units that can contain data – very much like entities in data modeling. An object is something in the real world. Objects also contain the necessary instructions, the algorithm to transform data. Objects are application development building blocks.
OO modeling and data modeling have a common objective of creating a representation of a real-world information system. Consequently, they frequently deal with the same concepts but use different terminology.
|
Data Modeling Term |
OO Modeling Term |
|
Entity |
Object class or classes |
|
Instance |
Object |
|
Attribute |
Attribute |
Data Abstraction is the process of creating an abstract model to enhance understanding of reality. An organizational chart shows the associations between roles in organization. A diagram of an investment portfolio is another example of data abstraction.
Encapsulation means that all processing that changes the state of an object (changes values) is done within that object. Encapsulation implies that an object is shielded from interference from other objects; that is, and object cannot directly change any other objects. Encapsulation means that data and methods can be packaged together.
Objects communicate with each other by sending and receiving messages. The receiving object needs a method for handling the message. A message can trigger a change in a receiving object.
An important feature
of OO is that classes in a generalization/specialization hierarchy will inherit
data and methods from the superclass.
Don’t reinvent the
wheel. If a class with similar characteristics is in existence …reuse it. But
be careful. Watch out for kangaroos.
Why use 00?
The slow growth in
programmer productivity is exacerbated by the massive effort required to keep
operational software systems current. Some companies spend as much as 80
percent of their potential development resources on maintenance of old systems.
OO software in
constructed by creating self contained object classes. Because each object is
relative independent there is less ripple effect, which often happens with
traditional programming methods. The code is easier to understand even if
documentation is sparse and the maintenance programmer has little knowledge of
the application.
Gathering client
information is difficult at best. OO advocates say that the OO modeling
techniques is more in tune to the way clients view their world.
Adding features is
deemed to be easier with OO methodology due to the independence of the existing
structure. Thus OO is suitable for evolutionary development in which the main
features of the system are first developed, and less critical elements added
later.
A model by
definition is an abstraction of something else. When modeling, we omit some of
the features or detail we find in the real world and incorporate only those we
find useful to the client’s needs.
A city map is a
model of some aspects of a city’s streets. A model should reduce the
communication gap between the client and the analyst. OO supporters assert that dealing with systems objects, which
closely resemble real-life objects is more natural than dealing with entities,
relationships and data flows.
The procedural model describes what the system needs to do and how
to do it. Often referred to as the dynamic part of the of an information systems
model, a procedural model has two parts.
The static aspects
are modeled in the class diagram, which shows the potential or the capabilities
of the system as opposed to what the system will actually do in a given
circumstance.
A simple banking
system is good example.
The basic modeling
concepts learned earlier apply to the OO modeling of the static part of the
system. Like entities, object classes can be related to other object classes.
Two broad types of
possible relationships are depicted in an OO model:
Finding objects and
classes is very similar to identifying entities. Which classes to include in a
system is highly dependent on the context, the client, and the modeler.
Although there is no correct set of classes the OO analyst should strive to
create a high fidelity model of the client’s world.
One approach is to
underline any nouns in the problem description. Most nouns are possible
classes, and underlining ensures that no potential classes are overlooked.
We will use a
four-layer model of OO analysis
Although presented
sequentially, actual analysis will usually not strictly follow such a pattern.
Customers can rent
diving equipment and boats from DDT. When a customer has seen what is available
and decided what to rent, a rental agreement or contract is produced and signed.
Underline the nouns
and we see that Customer, Diving equipment, Contract and Boat are prime
candidates for classes
There are two types
of structures:
Generalization
Depending on how and
with which classes we start, OO modeling can be top down or bottom up. If we
look at BOAT and DIVING EQUIPMENT first and realize that they can both be part
of a generalized class which we call RENTAL EQUIPMENT we have done bottom up
modeling. If we started with RENTAL EQUIPMENT we have done top-down modeling.
RENTAL EQUIPMENT
does not contain any objects, which is why it is an abstract class. It serves
solely as a vehicle for specifying the common attributes and methods of DIVING
EQUIPMENT and BOAT.
As we look at this
situation a bit closer we realize that more detail may be required. DIVING
EQUIPMENT would need specialized information about some of this equipment. For
example suits are described by size, thickness and type. SUIT would be a
subclass to DIVING EQUIPMENT.
An inheritance
relation is between classes, but inheritance is not like a relationship in data
modeling. The inheritance relationship between SUIT and DIVING EQUIPMENT does
not imply that a SUIT object (one specific suit) is linked to a corresponding
DIVING EQUIPMENT object. Here,
inheritance means that SUIT has the same methods and attributes and DIVING
EQUIPMENT and some other methods and attributes that are unique to SUIT.
So far,
generalization structures have been simple tree structures. Each subclass has
one superclass from which it inherits methods and attributes. Multiple
inheritance means that a class can inherit attributes and methods from more
than one superclass. The generalization structure becomes more like a lattice
than a tree.
As you may
anticipate, multiple inheritance can create problems when an application is
executed. There may be conflicts between the inherited methods and attributes.
Although there are many strategies for dealing with this problem the issue
remains unresolved. One strategy is not to support multiple inheritance, which
is the approach taken by some OOPL’s. Java does not support multiple
inheritance.
An aggregation shows the whole assembly and its parts. See page 396 of your text for an example. The relationship is sometimes call a whole/part structure because a TRAILER is a part of BOAT ASSEMBLY. It is also sometimes called an owned relationship. Note the cardinality.
The composition relationshipis distinguished from the other types of relationships by open or filled diamonds. In this case, we have a composite aggregation between boat assembly and the boat’s hull; they are inseparable. The other two aggregations are shared because a boat assembly might not always have a trailer or motor.
An OO model, like a data model records details of the aggregation relationship. In the case of an OO model the relationship is between objects.
Read the aggregation model as follows: (figure 13-10)
We will combine pieces to illustrate the current state of our OO model. Note this model is incomplete because it does not include CUSTOMER or CONTRACT.
We have covered the Class and object layer, and the structure layer. Now we will discuss the attribute layer.
The attribute layer has two purposes: specification of attributes and specification of association relationships. Specification is similar to data modeling. Some OO proponents contend that you need not define attributes until physical design or implementation. The author recommends that you specify attributes as you discover them. How much detail you include seems to depend on what you are modeling.
Sometimes called reference relationships, association relationships depict links between occurrences of an object. Also called instance connections, they are conceptually the same as relationships in data modeling.
The aggregation relationship is actually a special version of the association relationship. By using both aggregation and association modeling we come up with a model that is closer to depicting reality.
The required processing for each object is described in the service layer. Simple and complex services are distinguished. Simple services are the standard services that most objects need to be able to perform. These are usually implied as always being available and are not shown in the model. Simple services typically include create, connect, access and release.
Complex or nonstandard services, which are shown explicitly, fall into two categories: calculate and monitor. For example, the object class CONTRACT needs to calculate a rental fee. This nonstandard service is shown in the services window.
Mapping OO model to a relational database…here are some alternatives:
Persistent Objects
In OO programming,
objects only exist for the duration of a program’s execution. If we use OO for
information systems, we need to be able to store details of persistent objects,
those that last much longer. We need an ODBMS which stores simple and complex
objects. It also supports the creation of abstract data types, encapsulation
and inheritance.
Below is a useful URL which has journals associated with what we will cover this chapter.
http://lamp.infosys.deakin.edu.au/journals/index.php?page=alpha&alpha=D
Organizational
intelligence is the outcome of an
organization’s effort to collect, store, process, and interpret data from
internal and external sources. The conclusions or clues gleaned from an
organization’s data stores enable it to identify problems or opportunities,
which is the first stage of decision making.
Transaction
processing systems (TSPs) are a
core component of organizational memory and thus an important source of data.
Along with relevant external information,
the various TSPs are the bedrock of an organizational intelligence
system. TSPs are creating a massive torrent of data that potentially reveals to
an organization a great deal about its business and its customers.
Unfortunately, many
organizations are unable to exploit, either effectively or efficiently, the massive amount of data generated by TSPs
for the following reasons:
In other words
organizational memory is highly fragmented. A technology is needed that can
accumulate a considerable proportion or organizational memory into one readily
accessible system. In order to improve organizational performance, increase
revenues and provide better customer service decision makers need to have facts
readily available. An organization first needs to organize these data into one
logical collection and then use software to sift through this collection to
extract meaning.
A data warehouse is a subject oriented, integrated, time variant, and nonvolatile set of
data that supports decision making.
Associated with data
warehouse are data management aids, data extraction, analysis tools and
applications.
A data warehouse is
a snapshot of an organizatio, stored in operational data stores (ODS), are the raw material of a data warehouse.
Data must be extracted from many systems with different names and in different
formats however some data may be the same from these different systems. At the
time of operational system development there was no plan to ever merge the data
from the systems. As a result the extraction of data from many different
systems is time consuming and complex. Further extraction must be an on going
process so that analysts can work with current data.
Transformation is
part of the data extraction process. In the warehouse, data must be standardized
and follow consistent coding systems. There are several types of
transformation: