Beginning of Test 2 Material
March 3, 2003
Normalization is an alternative database design tool to data modeling. It was developed as part of theoretical foundation for the relational model application of a series of rules that gradually improve the design.
Data Modeling first emerged as E-R modeling in a
paper by Chen. There were two main concepts
Normalization is a method for increasing the
quality of database design. It is also a theoretical base for defining the
properties of relations. Normalization is useful cross check for ensuring the
soundness of a data model.
Relations can be classified by the types of
modification anomalies to which they are vulnerable. Meaning certain types of
tables can be prone to insertion anomalies. Some tables are prone to deletion
anomalies and others are prone to an update anomaly. Research was done in the
1970’s to classify these types of tables and to find a set of techniques that
would prevent these anomalies. The technique that was developed was
normalization.
Terminology for
Normalization:
One or more attributes determine the value of another attribute.
An identifier functionally determines all the attributes of an entity.
Basically if we are given the value of one attribute or set of attributes, we can look up the value for the other attribute.
It is denoted as
Stock code → firm name, stock price, stock quantity, stock dividend
In other words if we know the stock code we know the value of firm name etc.
An attribute, or set of attributes, that fully functionally determines another attribute is called a determinate. So in referring to stock code above we see that it is a determinant because it fully functionally determines stock PE.
Unlike a key a determinate need not be unique. In other words an attribute can determine another attribute. The example in the book shows that “course type determines fee.”
An identifier (key) can be and should be the determinate, however a determinate is not necessarily a key.
There are situations were a given value (attribute) determines multiple values.
A multi-determines B. Denoted as A →→B.
Example: Department multi-determines course.
There is a one – to – one relationship
A value of an attribute determines the value of another attribute and vice versa
Denoted as A →B and B →A.
When two attributes have a 1:1 relationship, they must occur together in at least one table in a database so that their equivalence is a recorded fact
There is a one to much relationship
A value of an attribute determines the value of another attribute but not vice versa
Denoted as A →B
Example:
As a result, if A and B occur in the same table, then A must be the key
There is a many to many relationship
Denoted as A not →B and B not →A
Example:
French and Flemish is spoken in Belgium and French is spoken in many countries
To record the m:m relationship between these attributes, a table containing both attributes as a composite key is required.
There are five levels of normalization.
The first level says A relation is in first normal
form if and only if all columns are single-valued.
In other words if you have table with single values in the columns the data model is in first normal form.
The second level says A relation is in second normal form if and only if it is in first normal form, and all non key columns are dependent on the key.
According to this definition, if a relation has a single
attribute as its key, then it is automatically in second normal form, by
default, every nonkey attribute is dependent on all of the key. Thus, second
normal form is of concern only in relations that have composite keys.
Slide 11 –
Second normal form is:
CUSTOMER-CREDIT in this case
|
ORDER |
|
|
|
ITEMNO |
CUSTOMERID |
QUANTITY |
CUSTOMER-CREDIT |
|
12 |
57 |
25 |
OK |
|
34 |
679 |
3 |
POOR |
So to fix this problem we would break the ORDER table back to the original table structure of ITEM and CUSTOMER
The third level says A relation is in third normal
form if and only if it is in second normal form and has no transitive
dependencies.
Transitive dependencies….
March 5, 2003
Transitive dependencies finally addressed. A transitive dependency occurs when part of the attributes are determined by the key and then those attributes which are determined by the key are the determinates for another attribute
Example
SID |
Building |
Fee |
|
100 |
Jesse Hall |
3200 |
|
150 |
Graig Hall |
3100 |
|
200 |
Jesse Hall |
3200 |
|
250 |
Jesse Hall |
3200 |
The functional dependency is that Building determines Fee and SID determines building.
This is remedied by splitting the table into two tables.
SID |
Building |
|
100 |
Jesse Hall |
|
150 |
Graig Hall |
|
200 |
Jesse Hall |
|
250 |
Miller Hall |
|
300 |
Jesse Hall |
|
Building |
Fee |
|
Jesse Hall |
3200 |
|
Graig Hall |
3100 |
|
Jesse Hall |
3200 |
|
Miller Hall |
3100 |
|
Jesse Hall |
3200 |
What happens when you have:
You cry and tear your hair …then you deal with it by inventing a stronger version of 3NF (third normal form).
The precise definition is a relation is in Boyce-Codd
normal form if and only if every determinant is a candidate key.
What is a candidate key?
By looking at the table below we see that by combining SID and Major we can come up with Fname. Remember when we use two attributes in combination this is called a composite key. The combination SID and Fname will get us Major. The combinations (SID , Major) and (SID, Fname) are candidate keys. The combination we pick to use will be the primary key.
Example:
This table is in third normal form however it fails the Boyce-Codd test.
|
SID |
Major |
Fname |
|
100 |
Math |
Cauchy |
|
150 |
Psychology |
Jung |
|
200 |
Math |
Riemann |
|
250 |
Math |
Cauchy |
|
300 |
Psychology |
Perls |
|
300 |
Math |
Reimann |
Primary key(s) SID, Major
Key (candidate) SID, Fname
Functional Dependencies Fname determines Major
Solution:
|
SID |
Fname |
|
100 |
Cauchy |
|
150 |
Jung |
|
200 |
Riemann |
|
250 |
Cauchy |
|
300 |
Perls |
|
300 |
Reimann |
|
Fname |
Subject |
|
Cauchy |
Math |
|
Jung |
Psychology |
|
Riemann |
Math |
|
Cauchy |
Math |
|
Perls |
Psychology |
|
Reimann |
Math |
Normalization continued
The precise definition is a relation is in fourth
normal form if it is in BCNF and has no multi-value dependencies.
|
Student |
|
|
|
|
Student ID |
Sport |
Subject |
… |
|
50 |
Football |
English |
… |
|
50 |
Football |
Music |
… |
|
50 |
Tennis |
Botany |
… |
|
50 |
Karate |
Botany |
… |
What is a multi-value dependency?
Refer to the above table. Note that Student ID can participate in many sports so Sport does not functionally depend on Student ID. In addition Student ID can participate in many subjects. In general a multi-value dependency exists when a relation has at least three attributes, two of them are multi-value, and their values depend on the third attribute.
In other words, in a relation R(A,B,C), a multi-value dependency exists if A determines multiple values of B, A determines multiple values of C and B and C are independent of each other.
Multi value dependencies lead to modification anomalies. The solution is to break the one table into two tables.
<![if !supportEmptons, as we have been doing, but then cannot be reconstructed. The condition under which this situation arises has no clear, intuitive meaning. We do not know what the consequences of such dependencies are or even if they have practical consequences…”
The precise definition is a relation is in domain
key/normal form if and only if every constraint on the relation is a logical
consequence of the domain constraints and the key constraints that apply to the
relation.
What are example of constraints?
What is a domain?
Unfortunately, there is no known algorithm for converting a relation to DK/NF, nor is it even known which relations can be converted to DK/NF. Finding, or designing, DK/NF relations is more of an art than a science.
In spite of this, in the practical work of database design, DK/NF is an exceedingly useful design objective. If we can define relations in a way that constraints on them are logical consequences of domains and keys, then there will be no modification anomalies.
Example 1 of Domain/Key Normal Form
STUDENT(SID, GradeLevel, Building, Fee)
Key: SID
Constraints: Building → Fee
SID must not begin with digit 1
We know:
The question is how do we create relations from the above information so that there will be no modification anomalies?
We give the domains for the attributes definitions:
SID in CDDD, where C is a decimal digit not = 1; D = decimal digit
GradeLevel in {FR, SO, JR, SN, GR}
Building in CHAR(4)
Fee in DEC(4)
STUDENT (SID, GradeLevel, Building)
Key: SID
BLDG-FEE (Building, Fee)
Key: Building
The E-R (Entity –Relationship) model
One of the most widely known data modeling methods is the entity-relationship model, developed by Chen. There is no standard for the E-R model, and it has been extended and modified in a number of ways.
Typically the shapes mean the following:
Rectangles represent entities
Diamonds represent relationships with the cardinality encapsulated inside the diamond
1:1 (one to one), 1:N (one to many) and M:N (many to many). One important difference is that a m:m is not shown as an associative entity; thus the database designer must convert this relationship to a table.
Ellipses represent attributes
The ways to represent relationships is varied (see page 222 of text) Some modeling techniques are very refined in their approach and other data modeling mediums have loose definition of relationships. Which ever model is chosen the same goal remains…improving database design. The data model chosen for use in this book was chosen for its simplicity and high quality results.
The Relational Model
So far this semester we have covered data modeling, normalization and E-R diagram modeling of database design. This chapter officially introduces the relational model. We have touched on components of the relational model but now we will bring the components together.
The relational model was developed as a result of recognized short comings of hierarchical and network DBMSs. According to Codd there are three good reasons for adopting the relational mode over other database models
Objectives of relational model research
The relational model has three components:
Domain – The set of values all of the same data type
Relations – a table of n columns and m rows. Each column has a unique name, and all the values in a column are drawn from the same domain. Each row is uniquely identified.
The cardinality of a relation is its number of rows. The degree of a relation is the number of columns. Hence one can expect that the cardinality of a relation will change but the degree shall not. The degree changes if a column is added to a relation, but in terms of relational theory, it is considered to be a new relation.
A relational database collection of relations or tables. Tables are linked by a common tuple that need not have the same name in both tables but it must draw on the same domain.
The primary key is the unique identifier
In some situations there may be several attributes – known
as candidate keys – that are potential primary keys. One of the
candidate keys is chosen as the primary and the candidate keys remaining are
then referred to as alternate keys.
A foreign key is an attribute (possibly composite) of a relation that is also a primary key of a relation. The foreign key and primary key may be in different relations or the same relation, but both keys should be drawn from the same domain.
No component of the primary key of a relation may be
null.
This is the entity integrity rule. This rule ensures that each instance of an entity described in a relation is identifiable in some way. Its implementation means that each row in a relation can be uniquely identified.
In short this enforces the definition of a foreign key. The referential integrity rule says that a foreign key must have a corresponding primary key.
We have a pretty good idea what a relational database consists of. We have been working with them all semester. According to our text there are four ways to manipulate relational tables.
We are fairly familiar with SQL and QBE now we will take a closer look at relational algebra and relational calculus.
While relational algebra can be used to resolve queries, it is seldom employed because it requires you to specify what you want and how to get it. This makes relational algebra more difficult to use than SQL.
Relational calculus is a non-procedural language, there for it is less concerned with how you get information. While relational calculus over comes some of the shortcomings of relational algebra it is harder to learn. As a result language designers developed SQL and QBE.
Relational algebra is the standard for judging data retrieval languages. If a language, such as SQL, can be used to express every relational algebra operator, it is said to be relationally complete,
Restrict - Creates a new table from specified rows of an existing table
Project- Creates a new table from specified columns of an existing table
Product – Creates a new table from all of the existing rows of two existing rows. If Table A has 2 columns and 3 rows; Table B has 3 columns and two rows. The product of A and B will have 5 columns and six rows.
Union – Creates a new table containing rows appearing in one or both tables of two existing tables. The relations must be union compatible, which means that they have the same column names, in the same order, and drawn on the same domain. Union is the sum of the rows in both tables. If it is in Table A or Table B it will be in Table C less the duplications.
Intersect – Creates a new table containing rows
appearing in both tables of two existing tables. If it is in Table A and it is in Table B it will be in
the resultant table. The relations must be union compatible.
Difference – Creates a new table containing rows appearing in one table but not in the other of two existing tables. The difference of two relations is a new relation containing all rows appearing in the first relation but not the second.
Join – creates a new relation from two relations for
all combinations of rows satisfying the join condition. Technically a join is a
product of two tables followed by a restrict.
Divide – Division requires that A and B have a set of attributes that are common to both relations.
The full set of relational operators is not required. Join, Divide and Intersect can be described and done by using combinations of the other operators. Hence there are five primitive relational algebra operators.
Comparison of relational algebra primitive operations and SQL:.
Operation |
Relational algebra |
SQL |
|
Restrict |
A where condition |
SELECT * FROM A WHERE condition |
|
Project |
A[X] |
SELECT X FROM A |
|
Product |
A times B |
SELECT * FROM A, B |
|
Union |
A union B |
SELECT * FROM A UNION SELECT * FROM B |
|
Difference |
A minus B |
SELECT * FROM A WHERE NOT EXISTS (SELECT * FROM B WHERE A.X = B.X AND A.Y = B.Y AND … |
A relational database has structures, integrity rules and a manipulation language. A fully relational database system provides complete support for each of these components. Many commercial systems support SQL, but do not provide support for domains or integrity rules. Such systems are not fully relational but are relationally complete
The proponents of OO maintain that ODBMS offers several advantages over current implementation of relational database technology. They assert that the shortcomings of the relational model are
It should be noted that there is considerable disagreement between representatives of OO and relational adherents. This chapter tries to present a balance of the view points.
Unified Modeling Language was introduced in 1997 by the Object Management Group. UML has become the common modeling language for OO software development. UML standardizes the notation for describing processes and has nine predefined diagrams.
We will focus on the class diagram because of its correspondence to the data model.
E-business developers are using UML to meet customer demands.
Traditional relational database structure in not designed to handle complex data structures such as
Object Oriented Programming Languages (OOPL) have been in the making for about 40 years. Smalltalk was developed by Xerox PARC in the 1970’s and introduced the terms object and OO. As you know OO programming has become the standard in many development arenas these days.
Key OO concepts
OO applications are created by assembling and using objects – self contained units that can contain data – very much like entities in data modeling. An object is something in the real world. Objects also contain the necessary instructions, the algorithm to transform data. Objects are application development building blocks.
OO modeling and data modeling have a common objective of creating a representation of a real-world information system. Consequently, they frequently deal with the same concepts but use different terminology.
|
Data Modeling Term |
OO Modeling Term |
|
Entity |
Object class or classes |
|
Instance |
Object |
|
Attribute |
Attribute |
Data Abstraction is the process of creating an abstract model to enhance understanding of reality. An organizational chart shows the associations between roles in organization. A diagram of an investment portfolio is another example of data abstraction.
Encapsulation means that all processing that changes the state of an object (changes values) is done within that object. Encapsulation implies that an object is shielded from interference from other objects; that is, and object cannot directly change any other objects. Encapsulation means that data and methods can be packaged together.
Objects communicate with each other by sending and receiving messages. The receiving object needs a method for handling the message. A message can trigger a change in a receiving object.
An important feature
of OO is that classes in a generalization/specialization hierarchy will inherit
data and methods from the superclass.
Don’t reinvent the
wheel. If a class with similar characteristics is in existence …reuse it. But
be careful. Watch out for kangaroos.
Why use 00?
The slow growth in
programmer productivity is exacerbated by the massive effort required to keep
operational software systems current. Some companies spend as much as 80
percent of their potential development resources on maintenance of old systems.
OO software in
constructed by creating self contained object classes. Because each object is
relative independent there is less ripple effect, which often happens with
traditional programming methods. The code is easier to understand even if
documentation is sparse and the maintenance programmer has little knowledge of
the application.
Gathering client
information is difficult at best. OO advocates say that the OO modeling
techniques is more in tune to the way clients view their world.
Adding features is
deemed to be easier with OO methodology due to the independence of the existing
structure. Thus OO is suitable for evolutionary development in which the main
features of the system are first developed, and less critical elements added
later.
A model by
definition is an abstraction of something else. When modeling, we omit some of
the features or detail we find in the real world and incorporate only those we
find useful to the client’s needs.
A city map is a
model of some aspects of a city’s streets. A model should reduce the
communication gap between the client and the analyst. OO supporters assert that dealing with systems objects, which
closely resemble real-life objects is more natural than dealing with entities,
relationships and data flows.
The procedural model describes what the system needs to do and how
to do it. Often referred to as the dynamic part of the of an information systems
model, a procedural model has two parts.
The static aspects
are modeled in the class diagram, which shows the potential or the capabilities
of the system as opposed to what the system will actually do in a given
circumstance.
A simple banking
system is good example.
The basic modeling
concepts learned earlier apply to the OO modeling of the static part of the
system. Like entities, object classes can be related to other object classes.
Two broad types of
possible relationships are depicted in an OO model:
Finding objects and
classes is very similar to identifying entities. Which classes to include in a
system is highly dependent on the context, the client, and the modeler.
Although there is no correct set of classes the OO analyst should strive to
create a high fidelity model of the client’s world.
One approach is to
underline any nouns in the problem description. Most nouns are possible
classes, and underlining ensures that no potential classes are overlooked.
We will use a
four-layer model of OO analysis
Although presented
sequentially, actual analysis will usually not strictly follow such a pattern.
Customers can rent
diving equipment and boats from DDT. When a customer has seen what is available
and decided what to rent, a rental agreement or contract is produced and signed.
Underline the nouns
and we see that Customer, Diving equipment, Contract and Boat are prime
candidates for classes
There are two types
of structures:
Generalization
Depending on how and
with which classes we start, OO modeling can be top down or bottom up. If we
look at BOAT and DIVING EQUIPMENT first and realize that they can both be part
of a generalized class which we call RENTAL EQUIPMENT we have done bottom up
modeling. If we started with RENTAL EQUIPMENT we have done top-down modeling.
RENTAL EQUIPMENT
does not contain any objects, which is why it is an abstract class. It serves
solely as a vehicle for specifying the common attributes and methods of DIVING
EQUIPMENT and BOAT.
As we look at this
situation a bit closer we realize that more detail may be required. DIVING
EQUIPMENT would need specialized information about some of this equipment. For
example suits are described by size, thickness and type. SUIT would be a
subclass to DIVING EQUIPMENT.
An inheritance
relation is between classes, but inheritance is not like a relationship in data
modeling. The inheritance relationship between SUIT and DIVING EQUIPMENT does
not imply that a SUIT object (one specific suit) is linked to a corresponding
DIVING EQUIPMENT object. Here,
inheritance means that SUIT has the same methods and attributes and DIVING
EQUIPMENT and some other methods and attributes that are unique to SUIT.
So far,
generalization structures have been simple tree structures. Each subclass has
one superclass from which it inherits methods and attributes. Multiple
inheritance means that a class can inherit attributes and methods from more
than one superclass. The generalization structure becomes more like a lattice
than a tree.
As you may
anticipate, multiple inheritance can create problems when an application is
executed. There may be conflicts between the inherited methods and attributes.
Although there are many strategies for dealing with this problem the issue
remains unresolved. One strategy is not to support multiple inheritance, which
is the approach taken by some OOPL’s. Java does not support multiple
inheritance.
An aggregation shows the whole assembly and its parts. See page 396 of your text for an example. The relationship is sometimes call a whole/part structure because a TRAILER is a part of BOAT ASSEMBLY. It is also sometimes called an owned relationship. Note the cardinality.
The composition relationshipis distinguished from the other types of relationships by open or filled diamonds. In this case, we have a composite aggregation between boat assembly and the boat’s hull; they are inseparable. The other two aggregations are shared because a boat assembly might not always have a trailer or motor.
An OO model, like a data model records details of the aggregation relationship. In the case of an OO model the relationship is between objects.
Read the aggregation model as follows: (figure 13-10)
We will combine pieces to illustrate the current state of our OO model. Note this model is incomplete because it does not include CUSTOMER or CONTRACT.
We have covered the Class and object layer, and the structure layer. Now we will discuss the attribute layer.
The attribute layer has two purposes: specification of attributes and specification of association relationships. Specification is similar to data modeling. Some OO proponents contend that you need not define attributes until physical design or implementation. The author recommends that you specify attributes as you discover them. How much detail you include seems to depend on what you are modeling.
Sometimes called reference relationships, association relationships depict links between occurrences of an object. Also called instance connections, they are conceptually the same as relationships in data modeling.
The aggregation relationship is actually a special version of the association relationship. By using both aggregation and association modeling we come up with a model that is closer to depicting reality.
The required processing for each object is described in the service layer. Simple and complex services are distinguished. Simple services are the standard services that most objects need to be able to perform. These are usually implied as always being available and are not shown in the model. Simple services typically include create, connect, access and release.
Complex or nonstandard services, which are shown explicitly, fall into two categories: calculate and monitor. For example, the object class CONTRACT needs to calculate a rental fee. This nonstandard service is shown in the services window.
Mapping OO model to a relational database…here are some alternatives:
Persistent Objects
In OO programming,
objects only exist for the duration of a program’s execution. If we use OO for
information systems, we need to be able to store details of persistent objects,
those that last much longer. We need an ODBMS which stores simple and complex
objects. It also supports the creation of abstract data types, encapsulation
and inheritance.
Below is a useful URL which has journals associated with what we will cover this chapter.
http://lamp.infosys.deakin.edu.au/journals/index.php?page=alpha&alpha=D
Organizational
intelligence is the outcome of an
organization’s effort to collect, store, process, and interpret data from
internal and external sources. The conclusions or clues gleaned from an
organization’s data stores enable it to identify problems or opportunities,
which is the first stage of decision making.
Transaction
processing systems (TSPs) are a
core component of organizational memory and thus an important source of data.
Along with relevant external information,
the various TSPs are the bedrock of an organizational intelligence
system. TSPs are creating a massive torrent of data that potentially reveals to
an organization a great deal about its business and its customers.
Unfortunately, many
organizations are unable to exploit, either effectively or efficiently, the massive amount of data generated by TSPs
for the following reasons:
In other words
organizational memory is highly fragmented. A technology is needed that can
accumulate a considerable proportion or organizational memory into one readily
accessible system. In order to improve organizational performance, increase
revenues and provide better customer service decision makers need to have facts
readily available. An organization first needs to organize these data into one
logical collection and then use software to sift through this collection to
extract meaning.
A data warehouse is a subject oriented, integrated, time variant, and nonvolatile set of
data that supports decision making.
Associated with data
warehouse are data management aids, data extraction, analysis tools and
applications.
A data warehouse is
a snapshot of an organizatio, stored in operational data stores (ODS), are the raw material of a data warehouse.
Data must be extracted from many systems with different names and in different
formats however some data may be the same from these different systems. At the
time of operational system development there was no plan to ever merge the data
from the systems. As a result the extraction of data from many different
systems is time consuming and complex. Further extraction must be an on going
process so that analysts can work with current data.
Transformation is
part of the data extraction process. In the warehouse, data must be standardized
and follow consistent coding systems. There are several types of
transformation:
Unfortunately, some of the data collected from applications
may be dirty – they contain errors, inconsistencies, or redundancies. Reasons to clean data:
Cleaning can be
performed using specialized software or custom written code
Data that have been
extracted, transformed, and cleaned can
be loaded into the warehouse. There are three types of data loads:
Scheduling
How often would you
load new data? Since refreshing the warehouse could potentially take many hours
and a data warehouse supports medium to
long term decision making, it is unlikely that it would need to be refreshed
more frequently than daily. This scheduling is part of the data center’s
regular operations.
Basically metadata
is data about data. Metadata is housed
in a data dictionary. It includes a description of each data type, its format,
coding standards and the meaning of the field. For the data warehouse
setting, data dictionary is likely to
include details of which operational system created the data, transformations
of the data, and the frequency of
extracts. Analysts need access to metadata so they can plan their analyses and
learn about the contents of the data warehouse.
Data Warehouse
Technology
Selecting an
appropriate data warehouse system is critical to support significant data
mining or OLAP. Data analysis often requires intensive processing of large
volumes of data, and large main memories are necessary for good performance. In
addition, the system should be scalable
so that as the demand for data analysis grows,
the system can be readily upgraded. The three key building blocks of a
data warehouse are the overall warehouse architecture, the server architectures, and the DBMS.
Designing a data
warehouse starts with selecting the physical and logical structure of the
warehouse architecture. The fundamental choice is between a centralized or
distributed data warehouse. A centralized
data warehouse gives processing
efficiency and lowers support costs.
A distributed architecture comes in two flavors, federated or tiered.
With a federated data warehouse,
data are distributed by function. For example, financial data are on one
DBMS server, marketing data on a different DBMS server at the same
location, and manufacturing data on a
DBMS server at a different location. To the analyst, the data warehouse may appear as one logical structure, but in order to reduce response time it is
physically dispersed across several related physical databases.
A tiered architecture houses highly aggregated data on an analyst’s
workstation, with the more detailed
summaries on a second server, and most
detailed data on a third server. The first tier handles most data requests, with the second and third tiers handling
fewer requests. The workstation at the
first tier is selected to handle a heavy data analyst workload, while a third
tier server is chosen to handle high data volumes but a light data analysis
workload. The second tier is a data mart,
a local single subject database. In some situations, a data mart may be
stand-alone rather than linked to the corporate data warehouse.
Server and DBMS
Selection
Servers hold and
deliver data to analysts. The selection of the type of server is determined by
an organization’s need for scalability, availability of servers, and ease of management of the system. There
are four options:
To improve
responsiveness, large data warehouses often require features that are not found
in a traditional RDBMS. In the
evolution process major RDBMS vendors are likely to integrate MDDB capabilities
into their products, just as they are also integrating object management.
The selection of
server architecture and DBMS are not independent decisions. Parallelism is
generally only an option for RDBMSs such as DB2 Parallel and Oracle Parallel,
and not available for multidimensional and object-relational technologies. The
data warehouse manager has to find the best fit between hardware and data
management technologies that meet the organizations goals.
Two of the more
recently developed approaches to analyzing a data store are data mining and
OLAP (On-line analytical processing)
The fundamental approaches to exploiting a data store
The verification approach to data analysis is driven by a hypothesis or conjecture about some relationship. Example: customers with incomes in the range of $50,000.00 – $75,000.00 are more likely to buy minivans. The analyst then formulates a query to process the data to test the hypothesis. The resulting report will verify or discredit the hypothesis. Depending on the result the same procedure is used over and over again until a customer group of likely prospects is discovered.
The verification method is highly dependent on the analyst’s prowess in finding that customer group.
Data mining uses the discovery approach. It sifts through the data in search of frequently appearing patterns and trends to report generalizations about the data. Data mining tools operate with minimal guidance from the operator. The advantage of discovery is that it may uncover important relationships that no amount of conjecturing would have revealed and tested.
On-line analytic
processing
Codd, the father of
the relational model, and colleagues proclaimed in 1993 that RDBMSs were never
intended to provide powerful functions for data synthesis, analysis, and
consolidation. This was the role of spreadsheets and special purpose
applications. They argued that analysts need data analysis tools that
complement the RDBMS technology and put forward the concept of OLAP, the analysis of business operations with the intention of making timely
and accurate analysis-based decisions.
Instead of rows and
columns, OLAP tools provide multidimensional views of data, as well as some
other differences. The OLAP tool
enables an analyst to view how many widgets were shipped to each region by each
quarter in 1997. The idea of OLAP is to give the analyst the power to view data
in a variety of ways at different levels.
The operations
supported by the typical OLAP tool include:
OLAP is typically used with a physical multidimensional database, a data management system in which data are
represented by a multidimensional structure.
The MDDB approach is to mirror and extend some of the features found in
the spreadsheets by moving beyond two dimensions. These tools are built
directly into the MDDB to increase the speed with which data can be retrieved
an manipulated. The dimensions of analysis must be identified prior to building
the database and changes can be costly and time consuming. MDDBs have size
limitations where RDBMs do not.
The current limit of
MDDB technology is approximately 10 dimensions, which can be millions to
trillions of data points. At this level response is too slow. However as MDDB
technology is implemented for multiprocessor server architecture, this current
limit will be extended.
An alternative to a physical MDDB is a relational OLAP (ROLAP) in which case a multidimensional model is imposed on a relational model. As we discussed earlier, this is also known as a logical MDDB. This system was designed to support OLAP and is superior.
The star model is used by some MDDBs to represent multidimensional data with in a relational structure. The center of the star is a table storing multidimensional facts derived from other tables. Linked to this table are the dimensions using the familiar primary key/foreign key approach of the relational model.
The advantage of the star model is that it makes use of a RDBMS, a mature technology capable of handling massive data stores and having extensive data management features like back up and recovery.
MDDB technology supports rotation of data objects. Example: Changing the view of data year by year.
MDDB technology supports drill-down. Example: Reporting the details for each nation in a selected region.
Drill-down can slice through several layers of summary data to get to finer levels of detail.
There is a fundamental difference between viewing representations of data between MDDB and RDBMS. We know that RDBMS are two dimensional and thought of as tables. MDDB uses the hypercube. On paper you can only show two dimensions. This is over come by thinking of the page as the third dimension.
A hypercube can have many dimensions. Can you visualize six dimensions?
A six-dimensional hypercube can be represented by combining dimensions as shown in Figure 14-11. Brand and store are combined in the row dimension by showing the stores within a brand. The column dimension, which for each type of furniture the units sold and revenue, combines the product group and variable dimensions. The page, third dimension, combines month and customer segment.
The Link Between RDBMS and MDDB
A quick inspection of Table 14-4 reveals that relational and multidimensional database technologies are designed for very different circumstances. (This was brought up in class).
Appropriate data can be periodically extracted from a RDBMS, aggregated, and loaded into a MDDB. Ideally, this process is automated so that the MDDB is continuously updated. There must be a connection from the MDDB to the RDBMS to facilitate access to data stored in the relational system.
The multidimensional model, based on the hypercube, requires a different design methodology from the relational model. At this stage there is no commonly used approach. Remember this is fairly new technology. Computer technology evolves quickly and good models take time to develop. The method proposed by Thomsen is a good example at this point in the evolutionary process.
First of all, it must be decided what must be tracked. A collection of tracking variables is called a variable dimension.
Next step is to consider what types of analyses will be
performed on the variables dimension. An example would be to analyze sales by
store by month. Or a comparison of this month’s sales with last month’s sales
for each product and sales by class of customer. Each set of identifying
factors is an identifier dimension.
Variables and identifiers are the key concepts of multidimensional database design. An identifier follows a regular pattern and variables do not in some cases. Identifiers and variables can be readily differentiated when there is a pattern. However there are cases when variables and identifiers are not readily set apart. See page 431 for an example of these cases.
The third step in the design of MDDBs is to consider the form of the dimensions. Dimensions in MDDB design are based on statistical variables. For those of you who have taken Probability and Statistics this will be a review.
A hypercube is typically a combination of several types of dimensions. For instance, the identifier dimensions could be product and store (nominal) and the variable dimensions could be the sales and customers. A dimensions type comes into play when analyzing relationships between identifiers and variables, which are known as independent and dependent variables in statistics.
The most powerful forms of analysis are available when both dimensions are continuous. Furthermore, it is always possible to recode a continuous variable into ordinal categories. See page 432 for an example.
This was a very brief description of the design of MDDBs. As the technology evolves a superior method will be developed. Just think when this class is taught in 5 years the focus might possibly be MDDB design and management.
Data mining is the search for relationships and global patterns that exist in large databases but are hidden in the vast amounts of data. In data mining, an analyst combines knowledge of the data with advanced machine learning technologies to discover nuggets of knowledge hidden in the data.
Data mining does not require the use of a data warehouse. An organization can mine data from its operational files or independent databases. However, data mining independent files will not uncover relationships that exist between data in those separate entities. It is recommended to get the most useful data to accumulate as much data as possible in a single data store, such as a data warehouse.
Suppose that a car buyer buys a new car every three or four years, and every time a vehicle is purchased the owner buys a more luxurious model. It is in the seller’s best interest to track the customer’s purchasing trends to anticipate the customer’s wants and needs. A common application of data mining is to develop, test, implement, measure and modify tailored marketing plans.
Associations
An association function identifies affinities existing among the collection of items in a given set of records. These affinities or relationships can be expressed by rules such as, 72 percent of all the records that contain items A,B, and C also contain D and E.
Sequential pattern mining functions identify frequently occurring sequences from a given set of records. For example, these functions can be used to detect the set of customers associated with certain frequent buying patterns.
Classifying divides predefined classes into mutually exclusive groups such that the members of each group are as close to as possible to one another, and different groups are as far as possible from one another, where distance is measured with predefined variables. The classification of groups is done before data analysis. Classification functions have been used extensively in applications such as credit risk analysis, portfolio selection, health risk analysis, and image and speech recognition.
Clustering
Clustering starts with just the data and discovers the hidden categories. These categories are derived from the data. Clustering divides the dataset into mutually exclusive groups such that the members of each group are as close as possible to one another, and different groups are as far as possible from one another, where distance is measured with respect to all available variables.
Prediction calculated the future value of a variable. For example, it might be used to predict the revenue of a new customer based on the person’s demographic variables.
The various data mining technologies can be used together. For example, a sequence pattern analysis could identify potential customers, and then classifying could be used to distinguish between those prospects who are converted to customers and those who are not. There is a lot of competition for consumer dollars. I would expect that data mining will continually be revised and improved on.
Data miners use technologies that are based on statistical analysis and data visualization.
Decision Trees
Tree-shaped structures can be used to represent decisions and rules for the classification of a dataset. As well a being easy to understand, tree-based models are suited to selecting important variables and are best when many of the predictors are irrelevant.
Genetic Algorithms
Genetic algorithms are optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution. Possible solutions for a problem compete with each other. Genetic algorithms are suited for optimization problems with many candidate variables.
K-Nearest Neighbor Method
The nearest neighbor method is used for clustering and classification. In the case of clustering, the method firsts plots each record in –dimensional space, where n attributes are used in the analysis. Then, it adjusts the weights for each dimension to cluster together data points with similar goal features. For instance, if the goal is to identify customers who frequently switch
phone companies, the k-nearest method would adjust the weights for relevant variables to cluster switching customers in the same neighborhood. Customers who did not switch would be clustered some distance apart.
Neural Networks
A neural network, mimicking the neurophysiology of the human brain can learn from examples to find patterns in data and classify data. While neural networks can be used for classification, they must first be trained to recognize patterns in a sample dataset. Once trained, a neural network can make predictions from new data.
Data Visualization
Data visualization can make it possible for the analyst to gain a deeper, intuitive understanding of data. Because they represent data in a visual format, visualization tools take advantage of our capacity to rapidly discern patterns. Data mining can enable the analyst to focus attention on important patterns and trends and explore these in depth using visualization techniques. Data mining and data visualization work especially well together.