Beginning of Test 2 Material

 

March 3, 2003

 

Normalization is an alternative database design tool to data modeling. It was developed as part of theoretical foundation for the relational model application of a series of rules that gradually improve the design. 

 

Data Modeling first emerged as E-R modeling in a paper by Chen. There were two main concepts

 

Normalization is a method for increasing the quality of database design. It is also a theoretical base for defining the properties of relations. Normalization is useful cross check for ensuring the soundness of a data model.

 

Relations can be classified by the types of modification anomalies to which they are vulnerable. Meaning certain types of tables can be prone to insertion anomalies. Some tables are prone to deletion anomalies and others are prone to an update anomaly. Research was done in the 1970’s to classify these types of tables and to find a set of techniques that would prevent these anomalies. The technique that was developed was normalization.

 

Terminology for Normalization:

 

Functional Dependency

 

One or more attributes determine the value of another attribute.

An identifier functionally determines all the attributes of an entity.

 

Basically if we are given the value of one attribute or set of attributes, we can look up the value for the other attribute.

 

It is denoted as

 

Stock code → firm name, stock price, stock quantity, stock dividend

In other words if we know the stock code we know the value of firm name etc.

 

An attribute, or set of attributes, that fully functionally determines another attribute is called a determinate. So in referring to stock code above we see that it is a determinant because it fully functionally determines stock PE.

 

Unlike a key a determinate need not be unique. In other words an attribute can determine another attribute. The example in the book shows that “course type determines fee.”

 

An identifier (key) can be and should be the determinate, however a determinate is not necessarily a key.

 

There are situations were a given value (attribute) determines multiple values.

 

A multi-determines B. Denoted as A →→B.

 

Example: Department multi-determines course.

 

Attribute Relationships

 

There is a one – to – one relationship

A value of an attribute determines the value of another attribute and vice versa

Denoted as  A →B and B →A.

 

When two attributes have a 1:1 relationship, they must occur together in at least one table in a database so that their equivalence is a recorded fact

 

There is a one to much relationship

A value of an attribute determines the value of another attribute but not vice versa

Denoted as  A →B

 

Example:

 

As a result,  if A and B occur in the same table, then A must be the key

 

There is a many to many relationship

Denoted as A not →B and B not →A

 

Example:

 

French and Flemish is spoken in Belgium and French is spoken in many countries

To record the m:m relationship between these attributes,  a table containing both attributes as a composite key is required.

 

Normal Forms

There are five levels of normalization.

 

The first level says A relation is in first normal form if and only if all columns are single-valued.

 

In other words if you have table with single values in the columns the data model is in first normal form.

 

 

The second level says A relation is in second normal form if and only if it is in first normal form, and all non key columns are dependent on the key.

 

According to this definition, if a relation has a single attribute as its key, then it is automatically in second normal form, by default, every nonkey attribute is dependent on all of the key. Thus, second normal form is of concern only in relations that have composite keys.

 

Slide 11 –

Second normal form is:

CUSTOMER-CREDIT in this case

 

ORDER

 

 

 

ITEMNO

CUSTOMERID

QUANTITY

CUSTOMER-CREDIT

12

57

25

OK

34

679

3

POOR

 

So to fix this problem we would break the ORDER table back to the original table structure of ITEM and CUSTOMER

 

The third level says A relation is in third normal form if and only if it is in second normal form and has no transitive dependencies.

 

Transitive dependencies….

 

March 5, 2003

 

Transitive dependencies finally addressed. A transitive dependency occurs when part of the attributes are determined by the key and then those attributes which are determined by the key are the determinates for another attribute

 

Example

 

SID

Building

Fee

100

Jesse Hall

3200

150

Graig Hall

3100

200

Jesse Hall

3200

250

Jesse Hall

3200

 

The functional dependency is that Building determines Fee and SID determines building.

This is remedied by splitting the table into two tables.

 

SID

Building

100

Jesse Hall

150

Graig Hall

200

Jesse Hall

250

Miller Hall

300

Jesse Hall

 

Building

Fee

Jesse Hall

3200

Graig Hall

3100

Jesse Hall

3200

Miller Hall

3100

Jesse Hall

3200

 

 

What happens when you have:

 

You cry and tear your hair …then you deal with it by inventing  a stronger version of 3NF (third normal form).

 

The precise definition is a relation is in Boyce-Codd normal form if and only if every determinant is a candidate key.

 

What is a candidate key?

By looking at the table below we see that by combining SID and Major we can come up with Fname. Remember when we use two attributes in combination this is called a composite key. The combination SID and Fname will get us Major. The combinations (SID , Major) and (SID, Fname) are candidate keys. The combination we pick to use will be the primary key.

 

Example:

This table is in third normal form however it fails the Boyce-Codd test.

 

SID

Major

Fname

100

Math

Cauchy

150

Psychology

Jung

200

Math

Riemann

250

Math

Cauchy

300

Psychology

Perls

300

Math

Reimann

 

Primary key(s) SID, Major

Key (candidate) SID, Fname

 

Functional Dependencies Fname determines Major

 

Solution:

 

SID

Fname

100

Cauchy

150

Jung

200

Riemann

250

Cauchy

300

Perls

300

Reimann

 

Fname

Subject

Cauchy

Math

Jung

Psychology

Riemann

Math

Cauchy

Math

Perls

Psychology

Reimann

Math

 

 

March 7, 2003

 

Normalization continued

The precise definition is a relation is in fourth normal form if it is in BCNF and has no multi-value dependencies.

 

Violation of fourth normal form

Student

 

 

 

Student ID

Sport

Subject

50

Football

English

50

Football

Music

50

Tennis

Botany

50

Karate

Botany

 

What is a multi-value dependency?

 

Refer to the above table. Note that Student ID can participate in many sports so Sport does not functionally depend on Student ID. In addition Student ID can participate in many subjects. In general a multi-value dependency exists when a relation has at least three attributes, two of them are multi-value,  and their values depend on the third attribute.

 

In other words, in a relation R(A,B,C), a multi-value dependency exists if A determines multiple values of B,  A determines multiple values of C and B and C are independent of each other.

 

Multi value dependencies lead to modification anomalies. The solution is to break the one table into two tables.

 

Fifth Normal Form

<![if !supportEmptons,  as we have been doing,  but then cannot be reconstructed. The condition under which this situation arises has no clear, intuitive meaning.  We do not know what the consequences of such dependencies are or even if they have practical consequences…”

 

March 10, 2003

 

Domain/Key Normal Form

 

The precise definition is a relation is in domain key/normal form if and only if every constraint on the relation is a logical consequence of the domain constraints and the key constraints that apply to the relation.

 

What are example of constraints?

 

What is a domain?

 

Unfortunately, there is no known algorithm for converting a relation to DK/NF, nor is it even known which relations can be converted to DK/NF. Finding, or designing, DK/NF relations is more of an art than a science.

 

In spite of this, in the practical work of database design, DK/NF is an exceedingly useful design objective. If we can define relations in a way that constraints on them are logical consequences of domains and keys, then there will be no modification anomalies.

 

Example 1 of Domain/Key Normal Form

 

STUDENT(SID, GradeLevel, Building, Fee)

            Key: SID

            Constraints: Building → Fee

                               SID must not begin with digit 1

 

We know:

  1. SID functionally determines the other three attributes so SID is a key
  2. We know that Building determines Fee.
  3. SID must not begin with one.

 

The question is how do we create relations from the above information so that there will be no modification anomalies?

 

We give the domains for the attributes definitions:

 

SID                    in    CDDD, where C is a decimal digit not = 1; D = decimal digit

GradeLevel        in {FR, SO, JR, SN, GR}

Building             in CHAR(4)

Fee                   in DEC(4)

 

We define relations and keys

 

STUDENT (SID, GradeLevel, Building)

Key: SID

 

BLDG-FEE (Building, Fee)

Key: Building

 

Other data modeling methods

 

The E-R (Entity –Relationship) model

 

One of the most widely known data modeling methods is the entity-relationship model, developed by Chen.  There is no standard for the E-R model, and it has been extended and modified in a number of ways.

 

Typically the shapes mean the following:

 

Rectangles represent entities

Diamonds represent relationships with the cardinality encapsulated inside the diamond

1:1 (one to one), 1:N (one to many) and M:N (many to many). One important difference is that a m:m is not shown as an associative entity; thus the database designer must convert this relationship to a table.

Ellipses represent attributes

 

Representing Relationships

The ways to represent relationships is varied (see page 222 of text) Some modeling techniques are very refined in their approach and other data modeling mediums have loose definition of relationships. Which ever model is chosen the same goal remains…improving database design. The data model chosen for use in this book was chosen for its simplicity and high quality results.

 

Chapter 9

 

The Relational Model

 

So far this semester we have covered data modeling, normalization and E-R diagram modeling of database design. This chapter officially introduces the relational model. We have touched on components of the relational model but now we will bring the components together.

 

The relational model was developed as a result of recognized short comings of hierarchical and network DBMSs. According to Codd there are three good reasons for adopting the relational mode over other database models

 

  1. Other models force the programmer to code at a level of structural detail. So application  programs are more complex and take longer to code and debug
  2. No commands are provided for processing multiple records at a time
  3. Only the relational model through a query language such as SQL, recognizes the client’s need for ad hoc queries.

 

Objectives of relational model research

 

  1. Create a clearly delineated boundary between the logical and physical aspects of database management. This is known as the data independence objective
  2. Create a simple model that was readily understood by a wide range of users and programmers. This is the communicability objective
  3. Increase processing capabilities from a record-at-a-time to multiple records at a time. This is known as the set processing objective.

 

The relational model has three components:

  1. data structures
  2. integrity rules
  3. operators used to retrieve, derive, or modify data

 

Data Structures

 

Domain – The set of values all of the same data type

Relations – a table of n columns and m rows. Each column has a unique name, and all the values in a column are drawn from the same domain. Each row is uniquely identified.

 

The cardinality of a relation is its number of rows.  The degree of a relation is the number of columns. Hence one can expect that the cardinality of a relation will change but the degree shall not. The degree changes if a column is added to a relation, but in terms of relational theory, it is considered to be a new relation.

 

A relational database collection of relations or tables. Tables are linked by a common tuple that need not have the same name in both tables but it must draw on the same domain.

 

The primary key is the unique identifier

 

In some situations there may be several attributes – known as candidate keys – that are potential primary keys. One of the candidate keys is chosen as the primary and the candidate keys remaining are then referred to as alternate keys.

 

A foreign key is an attribute (possibly composite) of a relation that is also a primary key of a relation. The foreign key and primary key may be in different relations or the same relation, but both keys should be drawn from the same domain.

 

Integrity Rules

 

No component of the primary key of a relation may be null.

 

This is the entity integrity rule. This rule ensures that each instance of an entity described in a relation is identifiable in some way. Its implementation means that each row in a relation can be uniquely identified.

 

A database must not contain any unmatched foreign key values

 

In short this enforces the definition of a foreign key. The referential integrity rule says that a foreign key must have a corresponding primary key.

 

We have a pretty good idea what a relational database consists of. We have been working with them all semester. According  to our text there are four ways to manipulate relational tables.

 

  1. SQL
  2. QBE
  3. Relational Algebra
  4. Relational Calculus

 

We are fairly familiar with SQL and QBE now we will take a closer look at relational algebra and relational calculus.

 

While relational algebra can be used to resolve queries, it is seldom employed because it requires you to specify what you want and how to get it. This makes relational algebra more difficult to use than SQL.

 

Relational calculus is a non-procedural language, there for it is less concerned with how you get information. While relational calculus over comes some of the shortcomings of relational algebra it is harder to learn. As a result language designers developed SQL and QBE.

 

Relational algebra is the standard for judging data retrieval languages. If a language, such as SQL,  can be used to express every relational algebra operator, it is said to be relationally complete,

 

Relational Algebra Operators

 

Restrict - Creates a new table from specified rows of an existing table

 

Project- Creates a new table from specified columns of an existing table

 

Product – Creates a new table from  all of the existing rows of two existing rows. If Table A has 2 columns and 3 rows; Table B has 3 columns and two rows. The product of A and B will have 5 columns and six rows.

 

Union – Creates a new table containing rows appearing in one or both tables of two existing tables. The relations must be union compatible, which means that they have the same column names,  in the same order, and drawn on the same domain. Union is the sum of the rows in both tables. If it is in Table A or Table B it will be in Table C less the duplications.

 

Intersect – Creates a new table containing rows appearing in both tables of two existing tables. If it is in Table A  and it is in Table B it will be in the resultant table. The relations must be union compatible.

 

Difference – Creates a new table containing rows appearing in one table but not in the other of two existing tables. The difference of two relations is a new relation containing all rows appearing in the first relation but not the second.

 

Join – creates a new relation from two relations for all combinations of rows satisfying the join condition. Technically a join is a product of two tables followed by a restrict.

 

Divide – Division requires that A and B have a set of attributes that are common to both relations.

 

March 12, 2003

 

The full set of relational operators is not required. Join, Divide and Intersect can be described and done by using combinations of the other operators. Hence there are five primitive relational algebra operators.

 

Comparison of relational algebra primitive operations and SQL:.

Operation

Relational algebra

SQL

Restrict

A where condition

SELECT * FROM A WHERE condition

Project

A[X]

SELECT X FROM A

Product

A times B

SELECT * FROM A, B

Union

A union B

SELECT * FROM A UNION SELECT * FROM B

Difference

A minus B

SELECT * FROM A

    WHERE NOT EXISTS

     (SELECT * FROM B WHERE

         A.X = B.X  AND  A.Y = B.Y AND …

 

 

A relational database has structures, integrity rules and a manipulation language. A fully relational database system provides complete support for each of these components. Many commercial systems support SQL, but do not provide support for domains or integrity rules. Such systems are not fully relational but are relationally complete

 

Chapter 13 – Object Oriented Data Management

 

The proponents of OO maintain that ODBMS offers several advantages over current implementation of relational database technology. They assert that the shortcomings of the relational model are

 

  1. It cannot handle complex objects such as multimedia;
  2. There is no support for general data types found in some programming languages;
  3. Performance degrades when large numbers of tables must be joined to respond to a query;
  4. There is a mismatch between the relational data model’s set at a time processing and the record-at- time processing of programming languages;
  5. There is no support for representing and recording change, such as different versions of objects.

 

It should be noted that there is considerable disagreement between representatives of OO and relational adherents. This chapter tries to present a balance of the view points.

 

UML

Unified Modeling Language was introduced in 1997 by the Object Management Group. UML has become the common modeling language for OO software development. UML standardizes the notation for describing processes and has nine predefined diagrams.

 

We will focus on the class diagram because of its correspondence to the data model.

 

E-business developers are using UML to meet customer demands.

 

Traditional relational database structure in not designed to handle complex data structures such as

 

 

Object Oriented Programming Languages (OOPL) have been in the making for about 40 years. Smalltalk was developed by Xerox PARC in the 1970’s and introduced the terms object and OO. As you know OO programming has become the standard in many development arenas these days.

 

Key OO concepts

 

OO applications are created by assembling and using objects – self contained units that can contain data – very much like entities in data modeling. An object is something in the real world. Objects also contain the necessary instructions, the algorithm to transform data. Objects are application development building blocks.

 

OO modeling and data modeling have a common objective of creating a representation of a real-world information system. Consequently, they frequently deal with the same concepts but use different terminology.

 

Data Modeling Term

OO Modeling Term

Entity

Object class or classes

Instance

Object

Attribute

Attribute

 

Data Abstraction is the process of creating an abstract model to enhance understanding of reality. An organizational chart shows the associations between roles in organization. A diagram of an investment portfolio is another example of data abstraction.

 

Encapsulation means that all processing that changes the state of an object (changes values) is done within that object. Encapsulation implies that an object is shielded from interference from other objects; that is, and object cannot directly change any other objects. Encapsulation means that data and methods can be packaged together.

 

Message passing

Objects communicate with each other by sending and receiving messages. The receiving object needs a method for handling the message. A message can trigger a change in a receiving object.

 

Inheritance

An important feature of OO is that classes in a generalization/specialization hierarchy will inherit data and methods from the superclass.

 

Reuse

Don’t reinvent the wheel. If a class with similar characteristics is in existence …reuse it. But be careful. Watch out for kangaroos.

 

Why use 00?

 

 

March 14, 2003

 

The slow growth in programmer productivity is exacerbated by the massive effort required to keep operational software systems current. Some companies spend as much as 80 percent of their potential development resources on maintenance of old systems.

 

    1. Programming is a difficult, intellectual task and prone to logic errors.
    2. Determining the client’s true needs is a major challenge
    3. Additional features may be needed
    4. The business environment changes
    5. Hardware changes

 

OO software in constructed by creating self contained object classes. Because each object is relative independent there is less ripple effect, which often happens with traditional programming methods. The code is easier to understand even if documentation is sparse and the maintenance programmer has little knowledge of the application.

 

Gathering client information is difficult at best. OO advocates say that the OO modeling techniques is more in tune to the way clients view their world.

 

Adding features is deemed to be easier with OO methodology due to the independence of the existing structure. Thus OO is suitable for evolutionary development in which the main features of the system are first developed, and less critical elements added later.

 

A model by definition is an abstraction of something else. When modeling, we omit some of the features or detail we find in the real world and incorporate only those we find useful to the client’s needs.

 

A city map is a model of some aspects of a city’s streets. A model should reduce the communication gap between the client and the analyst.  OO supporters assert that dealing with systems objects, which closely resemble real-life objects is more natural than dealing with entities, relationships and data flows.

 

Types of models

 

The procedural model describes what the system needs to do and how to do it. Often referred to as the dynamic part of the of an information systems model, a procedural model has two parts.

 

The static aspects are modeled in the class diagram, which shows the potential or the capabilities of the system as opposed to what the system will actually do in a given circumstance.

 

Scenarios

 

A simple banking system is good example.

 

Static OO modeling

 

The basic modeling concepts learned earlier apply to the OO modeling of the static part of the system. Like entities, object classes can be related to other object classes.

 

Two broad types of possible relationships are depicted in an OO model:

 

Finding Objects and Classes

 

Finding objects and classes is very similar to identifying entities. Which classes to include in a system is highly dependent on the context, the client, and the modeler. Although there is no correct set of classes the OO analyst should strive to create a high fidelity model of the client’s world.

 

One approach is to underline any nouns in the problem description. Most nouns are possible classes, and underlining ensures that no potential classes are overlooked.

 

We will use a four-layer model of OO analysis

 

 

Although presented sequentially, actual analysis will usually not strictly follow such a pattern.

 

Example – Class and Object layer

 

Customers can rent diving equipment and boats from DDT. When a customer has seen what is available and decided what to rent, a rental agreement or contract  is produced and signed.

 

Underline the nouns and we see that Customer, Diving equipment, Contract and Boat are prime candidates for classes

 

Structure Layer

 

There are two types of structures:

 

Generalization

 

Depending on how and with which classes we start, OO modeling can be top down or bottom up. If we look at BOAT and DIVING EQUIPMENT first and realize that they can both be part of a generalized class which we call RENTAL EQUIPMENT we have done bottom up modeling. If we started with RENTAL EQUIPMENT we have done top-down modeling.

 

RENTAL EQUIPMENT does not contain any objects, which is why it is an abstract class. It serves solely as a vehicle for specifying the common attributes and methods of DIVING EQUIPMENT and BOAT.

 

As we look at this situation a bit closer we realize that more detail may be required. DIVING EQUIPMENT would need specialized information about some of this equipment. For example suits are described by size, thickness and type. SUIT would be a subclass to DIVING EQUIPMENT.

 

An inheritance relation is between classes, but inheritance is not like a relationship in data modeling. The inheritance relationship between SUIT and DIVING EQUIPMENT does not imply that a SUIT object (one specific suit) is linked to a corresponding DIVING EQUIPMENT object.  Here, inheritance means that SUIT has the same methods and attributes and DIVING EQUIPMENT and some other methods and attributes that are unique to SUIT.

 

Multiple inheritance

 

So far, generalization structures have been simple tree structures. Each subclass has one superclass from which it inherits methods and attributes. Multiple inheritance means that a class can inherit attributes and methods from more than one superclass. The generalization structure becomes more like a lattice than a tree.

 

As you may anticipate, multiple inheritance can create problems when an application is executed. There may be conflicts between the inherited methods and attributes. Although there are many strategies for dealing with this problem the issue remains unresolved. One strategy is not to support multiple inheritance, which is the approach taken by some OOPL’s. Java does not support multiple inheritance.

 

March 17, 2003

 

Aggregation

 

An aggregation shows the whole assembly and its parts. See page 396 of your text for an example. The relationship is sometimes call a whole/part structure because a TRAILER is a part of  BOAT ASSEMBLY.  It is also sometimes called an owned relationship. Note the cardinality.

 

The composition relationshipis distinguished from the other types of relationships by open or filled diamonds. In this case, we have a composite aggregation between boat assembly and the boat’s hull; they are inseparable. The other two aggregations are shared because a boat assembly might not always have a trailer or motor.

 

An OO model, like a data model records details of the aggregation relationship. In the case of an OO model the relationship is between objects.

 

Read the aggregation model as follows: (figure 13-10)

  1. A boat has zero or one trailers,  and a trailer belongs to one boat;
  2. A boat has zero, one , or two motors, and a motor belongs to one boat;
  3. A boat has one hull,  and one hull belongs to one boat.

 

We will combine pieces to illustrate the current state of our OO model. Note this model is incomplete because it does not include CUSTOMER or CONTRACT.

 

We have covered the Class and object layer, and the structure layer. Now we will discuss the attribute layer.

 

The attribute layer has two purposes: specification of attributes and specification of association relationships. Specification is similar to data modeling. Some OO proponents contend that you need not define attributes until physical design or implementation. The author recommends that you specify attributes as you discover them. How much detail you include seems to depend on what you are modeling.

 

Association Relationships

 

Sometimes called reference relationships, association relationships depict links between occurrences of an object.  Also called instance connections, they are conceptually the same as relationships in data modeling.

 

The aggregation relationship is actually a special version of the association relationship. By using both aggregation and association modeling we come up with a model that is closer to depicting reality.

 

Service layer

 

The required processing for each object is described in the service layer. Simple and complex services are distinguished. Simple services are the standard services that most objects need to be able to perform. These are usually implied as always being available and are not shown in the model. Simple services typically include create, connect, access and release.

 

Complex or nonstandard services, which are shown explicitly,  fall into two categories: calculate and monitor. For example, the object class CONTRACT needs to calculate a rental fee. This nonstandard service is shown in the services window.

 

Comparison of OO Modeling and Relational Paradigms

 

Mapping OO model to a relational database…here are some alternatives:

 

 

Persistent Objects

 

In OO programming, objects only exist for the duration of a program’s execution. If we use OO for information systems, we need to be able to store details of persistent objects, those that last much longer. We need an ODBMS which stores simple and complex objects. It also supports the creation of abstract data types, encapsulation and inheritance.

 

Fundamental Differences between ODBMS and RDBMS Technology

 

March 19, 2003

 

Below is a useful URL which has journals associated with what we will cover this chapter.

http://lamp.infosys.deakin.edu.au/journals/index.php?page=alpha&alpha=D

 

 

Organizational intelligence is the outcome of an organization’s effort to collect, store, process, and interpret data from internal and external sources. The conclusions or clues gleaned from an organization’s data stores enable it to identify problems or opportunities, which is the first stage of decision making.

 

Transaction processing systems (TSPs) are a core component of organizational memory and thus an important source of data. Along with relevant external information,  the various TSPs are the bedrock of an organizational intelligence system. TSPs are creating a massive torrent of data that potentially reveals to an organization a great deal about its business and its customers.

 

Unfortunately, many organizations are unable to exploit, either effectively or efficiently,  the massive amount of data generated by TSPs for the following reasons:

 

 

In other words organizational memory is highly fragmented. A technology is needed that can accumulate a considerable proportion or organizational memory into one readily accessible system. In order to improve organizational performance, increase revenues and provide better customer service decision makers need to have facts readily available. An organization first needs to organize these data into one logical collection and then use software to sift through this collection to extract meaning.

 

A data warehouse is a subject oriented, integrated, time variant, and nonvolatile set of data that supports decision making.

 

 

Associated with data warehouse are data management aids, data extraction, analysis tools and applications.

 

A data warehouse is a snapshot of an organizatio, stored in operational data stores (ODS),  are the raw material of a data warehouse. Data must be extracted from many systems with different names and in different formats however some data may be the same from these different systems. At the time of operational system development there was no plan to ever merge the data from the systems. As a result the extraction of data from many different systems is time consuming and complex. Further extraction must be an on going process so that analysts can work with current data.

 

Transformation

Transformation is part of the data extraction process. In the warehouse, data must be standardized and follow consistent coding systems. There are several types of transformation:

 

 

Cleaning

Unfortunately,  some of the data collected from applications may be dirty – they contain errors, inconsistencies,  or redundancies. Reasons to clean data:

 

 

Cleaning can be performed using specialized software or custom written code

 

Loading

Data that have been extracted, transformed,  and cleaned can be loaded into the warehouse. There are three types of data loads:

 

  1. Archival: Historical data ( sales for the period 1980-1992) that is loaded once
  2. Current: Data from current operational systems
  3. Ongoing: Continual revision of the warehouse as operational data are generated. This is the largest challenge for warehouse management. This loading is done either by completely reloading the data warehouse or just updating it with the changes.

 

Scheduling

How often would you load new data? Since refreshing the warehouse could potentially take many hours and a data warehouse supports medium  to long term decision making, it is unlikely that it would need to be refreshed more frequently than daily. This scheduling is part of the data center’s regular operations.

 

Metadata

Basically metadata is data about data.  Metadata is housed in a data dictionary. It includes a description of each data type, its format, coding standards and the meaning of the field. For the data warehouse setting,  data dictionary is likely to include details of which operational system created the data, transformations of the data,  and the frequency of extracts. Analysts need access to metadata so they can plan their analyses and learn about the contents of the data warehouse.

 

Data Warehouse Technology

Selecting an appropriate data warehouse system is critical to support significant data mining or OLAP. Data analysis often requires intensive processing of large volumes of data, and large main memories are necessary for good performance. In addition,  the system should be scalable so that as the demand for data analysis grows,  the system can be readily upgraded. The three key building blocks of a data warehouse are the overall warehouse architecture,  the server architectures, and the DBMS.

 

Designing a data warehouse starts with selecting the physical and logical structure of the warehouse architecture. The fundamental choice is between a centralized or distributed data warehouse. A centralized data warehouse gives processing efficiency and lowers support costs.

 

A distributed architecture comes in two flavors, federated or tiered. With a federated data warehouse,  data are distributed by function. For example, financial data are on one DBMS server, marketing data on a different DBMS server at the same location,  and manufacturing data on a DBMS server at a different location. To the analyst,  the data warehouse may appear as one logical structure,  but in order to reduce response time it is physically dispersed across several related physical databases.

 

A tiered architecture houses highly aggregated data on an analyst’s workstation,  with the more detailed summaries on a second server,  and most detailed data on a third server. The first tier handles most data requests,  with the second and third tiers handling fewer requests.  The workstation at the first tier is selected to handle a heavy data analyst workload, while a third tier server is chosen to handle high data volumes but a light data analysis workload. The second tier is a data mart,  a local single subject database. In some situations, a data mart may be stand-alone rather than linked to the corporate data warehouse.

 

March 21, 2003

 

Server and DBMS Selection

 

Servers hold and deliver data to analysts. The selection of the type of server is determined by an organization’s need for scalability, availability of servers,  and ease of management of the system. There are four options:

 

  1. Single Processor – This is the simplest option, which is easy to manage but has limited processing power and scalability. Also a single server system limits reliability because when it is unavailable, so it the warehouse.
  2. Symmetric multiprocessor (SMP) – This option has multiple servers sharing memory and disks. Because processors can be added as additional processing capacity is needed, an SMP is scalable. However, as processors are added, the memory bus can become congested,  and too many CPUs will slow performance. The operating system must be designed for multiprocessing.
  3. Massively parallel processor  (MPP) – This option connects an array of processors with a high speed, high bandwidth link. These processors have independent memory and disks. Applications must be designed to work in parallel. Web research indicates that this is used for scientific computing like climate modeling.
  4. Nonuniform memory access (NUMA) – This machine joins multiple SMP nodes into a single, distributed memory pool with a single, distributed memory pool with a single operating system. NUMA has the simplicity of operation of an SMP,  and existing DBMSs and applications can be used without modification.

 

DBMS Options

 

To improve responsiveness, large data warehouses often require features that are not found in a traditional RDBMS.  In the evolution process major RDBMS vendors are likely to integrate MDDB capabilities into their products, just as they are also integrating object management.

 

The selection of server architecture and DBMS are not independent decisions. Parallelism is generally only an option for RDBMSs such as DB2 Parallel and Oracle Parallel, and not available for multidimensional and object-relational technologies. The data warehouse manager has to find the best fit between hardware and data management technologies that meet the organizations goals.

 

Exploiting data stores

 

Two of the more recently developed approaches to analyzing a data store are data mining and OLAP (On-line analytical processing)

 

Verification and discovery

The fundamental approaches to exploiting a data store

 

The verification approach to data analysis is driven by a hypothesis or conjecture about some relationship. Example: customers with incomes in the range of $50,000.00 – $75,000.00 are more likely to buy minivans. The analyst then formulates a query to process the data to test the hypothesis. The resulting report will verify or discredit the hypothesis. Depending on the result the same procedure is used over and over again until a customer group of likely prospects is discovered.

 

The verification method is highly dependent on the analyst’s prowess in finding that customer group.

 

Data mining uses the discovery approach. It sifts through the data in search of frequently appearing patterns and trends to report generalizations about the data. Data mining tools operate with minimal guidance from the operator. The advantage of discovery is that it may uncover important relationships that no amount of conjecturing would have revealed and tested.

 

 

On-line analytic processing

 

Codd, the father of the relational model, and colleagues proclaimed in 1993 that RDBMSs were never intended to provide powerful functions for data synthesis, analysis, and consolidation. This was the role of spreadsheets and special purpose applications. They argued that analysts need data analysis tools that complement the RDBMS technology and put forward the concept of OLAP, the analysis of business operations with the intention of making timely and accurate analysis-based decisions.

 

Instead of rows and columns, OLAP tools provide multidimensional views of data, as well as some other differences.  The OLAP tool enables an analyst to view how many widgets were shipped to each region by each quarter in 1997. The idea of OLAP is to give the analyst the power to view data in a variety of ways at different levels.

 

The operations supported by the typical OLAP tool include:

 

Multidimensional databases

 

OLAP is typically used with a physical multidimensional database, a data management system in which data are represented by a multidimensional structure.  The MDDB approach is to mirror and extend some of the features found in the spreadsheets by moving beyond two dimensions. These tools are built directly into the MDDB to increase the speed with which data can be retrieved an manipulated. The dimensions of analysis must be identified prior to building the database and changes can be costly and time consuming. MDDBs have size limitations where RDBMs do not.

 

The current limit of MDDB technology is approximately 10 dimensions, which can be millions to trillions of data points. At this level response is too slow. However as MDDB technology is implemented for multiprocessor server architecture, this current limit will be extended.

 

OLAP Example

 

ROLAP

An alternative to a physical MDDB is a relational OLAP (ROLAP) in which case a multidimensional model is imposed on a relational model. As we discussed earlier, this is also known as a logical MDDB. This system was designed to support OLAP and is superior.

 

The star model is used by some MDDBs to represent multidimensional data with in a relational structure. The center of the star is a table storing multidimensional facts derived from other tables. Linked to this table are the dimensions using the familiar primary key/foreign key approach of the relational model.

 

The advantage of the star model is that it makes use of a RDBMS, a mature technology capable of handling massive data stores and having extensive data management features like back up and recovery.

 

Rotation, drill-down, and drill through

 

MDDB technology supports rotation of data objects. Example: Changing the view of data year by year.

MDDB technology supports drill-down. Example: Reporting the details for each nation in a selected region.

Drill-down can slice through several layers of summary data to get to finer levels of detail.

 

The hypercube

 

There is a fundamental difference between viewing representations of data between MDDB and RDBMS. We know that RDBMS are two dimensional and thought of as tables. MDDB uses the hypercube. On paper you can only show two dimensions. This is over come by thinking of the page as the third dimension.

 

A hypercube can have many dimensions. Can you visualize six dimensions?

 

A six-dimensional hypercube can be represented by combining dimensions as shown in Figure 14-11. Brand and store are combined in the row dimension by showing the stores within a brand. The column dimension, which for each type of furniture the units sold and revenue, combines the product group and variable dimensions. The page,  third dimension, combines month and customer segment.

 

March 31, 2003

 

The Link Between RDBMS and MDDB

 

A quick inspection of Table 14-4 reveals that relational and multidimensional database technologies are designed for very different circumstances. (This was brought up in class).

Appropriate data can be periodically extracted from a RDBMS, aggregated, and loaded into a MDDB. Ideally, this process is automated so that the MDDB is continuously updated.  There must be a connection from the MDDB to the RDBMS to facilitate access to data stored in the relational system.

 

Designing a Multidimensional Database

 

The multidimensional model, based on the hypercube, requires a different design methodology from the relational model. At this stage there is no commonly used approach. Remember this is fairly new technology. Computer technology evolves quickly and good models take time to develop.  The method proposed by Thomsen is a good example at this point in the evolutionary process.

 

First of all, it must be decided what must be tracked.  A collection of tracking variables is called a variable dimension.

 

Next step is to consider what types of analyses will be performed on the variables dimension. An example would be to analyze sales by store by month. Or a comparison of this month’s sales with last month’s sales for each product and sales by class of customer. Each set of identifying factors is an identifier dimension.

 

Variables and identifiers are the key concepts of multidimensional database design. An identifier follows a regular pattern and variables do not in some cases. Identifiers and variables can be readily differentiated when there is a pattern. However there are cases when variables and identifiers are not readily set apart. See page 431 for an example of these cases.

 

The third step in the design of MDDBs is to consider the form of the dimensions. Dimensions in MDDB design are based on statistical variables. For those of you who have taken Probability and Statistics this will be a review.

 

 

A hypercube is typically a combination of several types of dimensions. For instance,  the identifier dimensions could be product and store (nominal) and the variable dimensions could be the sales and customers.  A dimensions type comes into play when analyzing relationships between identifiers and variables, which are known as independent and dependent variables in statistics. 

 

The most powerful forms of analysis are available when both dimensions are continuous. Furthermore, it is always possible to recode a continuous variable into ordinal categories.  See page 432 for an example.

 

This was a very brief description of the design of MDDBs. As the technology evolves a superior method will be developed. Just think when this class is taught in 5 years the focus might possibly be MDDB design and management.

 

Data Mining

 

Data mining is the search for relationships and global patterns that exist in large databases but are hidden in the vast amounts of data. In data mining, an analyst combines knowledge of the data with advanced machine learning technologies to discover nuggets of knowledge hidden in the data.

 

Data mining does not require the use of a data warehouse. An organization can mine data from its operational files or independent databases. However, data mining independent files will not uncover relationships that exist between data in those separate entities. It is recommended to get the most useful data to accumulate as much data as possible in a single data store, such as a data warehouse.

 

Suppose that a car buyer buys a new car every three or four years, and every time a vehicle is purchased the owner buys a more luxurious model. It is in the seller’s best interest to track the customer’s purchasing trends to anticipate the customer’s wants and needs. A common application of data mining is to develop, test, implement, measure and modify tailored marketing plans.

 

Data Mining Fuctions

 

Associations

An association function identifies affinities existing among the collection of items in a given set of records. These affinities or relationships can be expressed by rules such as, 72 percent of all the records that contain items A,B, and C also contain D and E.

 

Sequential patterns

Sequential pattern mining functions identify frequently occurring sequences from a given set of records. For example,  these functions can be used to detect the set of customers associated with certain frequent buying patterns.

 

Classifying

Classifying divides predefined classes into mutually exclusive groups such that the members of each group are as close to as possible to one another, and different groups are as far as possible from one another, where distance is measured with predefined variables. The classification of groups is done before data analysis.  Classification functions have been used extensively in applications such as credit risk analysis, portfolio selection, health risk analysis, and image and speech recognition.

 

Clustering

Clustering starts with just the data and discovers the hidden categories. These categories are derived from the data. Clustering divides the dataset into mutually exclusive groups such that the members of each group are as close as possible to one another, and different groups are as far as possible from one another, where distance is measured with respect to all available variables.

 

Prediction

Prediction calculated the future value of a variable. For example,  it might be used to predict the revenue of a new customer based on the person’s demographic variables.

 

 

The various data mining technologies can be used together. For example, a  sequence pattern analysis could identify potential customers, and then classifying could be used to distinguish between those prospects who are converted to customers and those who are not. There is a lot of competition for consumer dollars. I would expect that data mining will continually be revised and improved on.

 

Data miners use technologies that are based on statistical analysis and data visualization.

 

Decision Trees

 

Tree-shaped structures can be used to represent decisions and rules for the classification of a dataset. As well a being easy to understand, tree-based models are suited to selecting important variables and are best when many of the predictors are irrelevant.

 

Genetic Algorithms

 

Genetic algorithms are optimization techniques that use processes such as genetic combination, mutation,  and natural selection in a design based on the concepts of evolution. Possible solutions for a problem compete with each other. Genetic algorithms are suited for optimization problems with many candidate variables.

 

K-Nearest Neighbor Method

 

The nearest neighbor method is used for clustering and classification.  In the case of clustering,  the method firsts plots each record in –dimensional space, where n attributes are used in the analysis. Then, it adjusts the weights for each dimension to cluster together data points with similar goal features.  For instance, if the goal is to identify customers who frequently switch

phone companies,  the k-nearest method would adjust the weights for relevant variables to cluster switching customers in the same neighborhood. Customers who did not switch would be clustered some distance apart.

 

Neural Networks

 

A neural network, mimicking the neurophysiology of the human brain can learn from examples to find patterns in data and classify data. While neural networks can be used for classification, they must first be trained to recognize patterns in a sample dataset. Once trained, a neural network can make predictions from new data.

 

Data Visualization

 

Data visualization can make it possible for the analyst to gain a deeper, intuitive understanding of data. Because they represent data in a visual format, visualization tools take advantage of our capacity to rapidly discern patterns. Data mining can enable the analyst to focus attention on important patterns and trends and explore these in depth using visualization techniques. Data mining and data visualization work especially well together.