Functional
Dependencies
and
Normalization
Chapter 10
Designing a database
•
So far, we have designed
our base tables from a E-R diagram or by common sense
•
We still need some
formal measure of why one group of attributes is a better base table than
another group.
•
This chapter discusses
some of the theory of what a “good” design is
Gathering the data
• Identify the data requirements of the users.
–
This sounds simple, and
it CAN be, but usually isn't because users rarely can clearly describe what
they need.
–
In addition, the data
processing staff, who are quite knowledgeable how the
system functions, often have tunnel vision, which inhibits creative thought.
Normalization helps make this process
objective
• In many respects, normalization is simply applied
common sense.
• It is a technique for reviewing the entity/attribute
list to insure that attributes are stored "where they belong"
•
We did not have
any formal measure of why one grouping of attributes into a relation may be
better than another.
• We will be talking about the logical level or
second level of the architecture. i.e
designing base tables.
Informal discussion of
some of the criteria for good and bad relational schema.
• 1. The meaning (semantics) of attributes
• 2. Redundant Information and Update Anomalies
• 3. Avoiding NULL values
• 4. Avoiding generation of spurious tuples
1. The meaning (semantics) of attributes
•
When we group
attributes to form a relation, we assume that a certain meaning is associated
with the attributes.
•
This meaning
helps us understand how the attributes in a tuple are related to each other.
• In general, the easier it is to explain the semantics
of the relation, the better the relation schema design.
• GUIDELINE 1: Design a relation
schema so that it has attributes of only one entity.
2. Redundant Information and Update
Anomalies
(or why some
organizations do not work as well as others)
•
What is the semantics of
these relations?
EMP_DEPT
EMP_PROJ
Why are these poor designs?
•
These relations also
have clear meanings.
•
There is nothing
logically wrong with these two relations,
–
but they mix attributes from distinct real-world
entities.
• They may be used as views, but they cause several
problems when used as base tables.
What are the problems?
•
They take up more
storage space because of data redundancy.
•
Another serious problem
with using relations like this as base tables is the problem of update
anomalies.
– anomalies are the result of a poorly designed database.
There are three types of Update Anomalies
•
1. Insertion anomalies
–
Must be careful that
DNAME & DMGRSSN agree with DNUM
–
Difficult to insert a
new department that has no employees yet
• 2. Deletion anomalies
–
If we delete the last
employee of a dept, we eliminate the information on that department
• 3. Modification anomalies
–
If dept 4 gets a new
manager, then
all the employees who work in dept 4's tuples must be updated.
•
GUIDELINE 2: Design the base tables so no insertion,
deletion, or modification anomalies occur in the relations.
The second guideline is consistent
with and in a way a restatement of the first.
•
We begin to see a
need for a more formal approach to help us evaluate whether our design meets
these guidelines.
3. Avoiding NULL values
•
In some base
tables we may group many attributes together into a "fat" relation
•
If many of the
attributes do not apply to all tuples in the relation, we end up with many NULL's in the relation.
•
This can be
wasteful of storage space, and may also lead to problems with understanding of
the meaning of the attributes.
•
NULL's can have multiple interpretations
–
The attribute
does not apply to this tuple
–
The attribute
value for this tuple is unknown
– The value is known but absent (has not been recorded
yet)
• GUIDELINE 3: Avoid placing
attributes in a base table whose values may be NULL, as far a possible.
•
For example, if only 10%
of employees have individual offices, do not put an attribute OFF_NUM in the
EMPLOYEE relation.
•
Solution: Create a new
relation EMP_OFF that includes only tuples for employees who have offices.
4. Avoid generation of spurious tuples
• Look at the relations EMP_LOC and EMP_PROJ1
•
A tuple in EMP_LOC
means that an employee works on some project at plocation
•
A tuple in EMP_PROJ1
means that an employee with SSN works on project for hours/week, the name and
location of the project are also kept
Using EMP_LOC and EMP_PROJ1 as base tables
•
This is a very bad
schema design
•
If we do a natural join
of these tables having as join conditions:
–
EMP_LOC.plocation = EMP_PROJ1.plocation
•
We get many spurious
tuples
•
A spurious tuple
has wrong information; information that is not valid
•
Why is this
happening?
–
PLOCATION is not
a PK in either table, so it cannot be a FK
Functional Dependency
•
This is a more
formal, more objective design criteria.
•
Functional
dependency is the single most important concept in relational database design.
•
A FD is a
property of the meaning or semantics of the attributes of a relation.
–
Before I give a
formal definition, lets look at dependency informally first.
–
Figure
out which attributes are dependent on other attributes.
Formal Definition of Functional
Dependency
•
Given a relation R and
attributes X and Y,
•
we say Y is functionally dependent on X if and only if
–
for each X value there is precisely one Y value
associated with it.
–
i.e. X functionally determines Y
•
this is written like this: X -> Y
•
What are the functional
Dependencies of the EMP_DEPT and the EMP_PROJ relations
The FD of EMP_DEPT and the EMP_PROJ
relations
SSN -> {ename, bdate,
address, dnumber}
dnumber -> {dname,
dmgrssn}
{SSN, pnumber} ->{hours}
SSN -> {ename}
pnumber -> {pname, plocation}
Full functional dependency
•
If the X is composite,
then there is a possibility that Y is not fully functionally dependent on X.
•
To be fully functionally
dependent, Y must not be dependent on any proper subset of X.
•
{SSN,PNUMBER} -> ENAME
is not fully functionally dependent because ENAME is dependent on SSN, a subset
of {SSN, PNUMBER}.
The set of FD on a relation schema
•
Let the letter F
denotes the set of FDs on a relation
•
Example for the
EMP_DEPT relation
F
= {SSN -> {ename,
bdate, address, dnumber
dnumber -> {dname, dmgrssn}}
•
These are the FD
that are semantically obvious
•
Are they the only
FD?
–
There are many
other FD that can be inferred or deduced from the semantic obvious ones
•
Example
–
SSN-> {dname, dmgrssn} or even
SSN -> SSN, or dnumber
-> dname
Inferred FDs
• Suppose F is the set of functional dependencies
that are specified on relational schema R
–
Typically, the schema
designer specifies the FD that are semantically obvious
–
But there are other depenencies that can be inferred or deduced from the FDs in F
•
The closure F+ of
F is the set of all functional dependencies that can be inferred from F
Inference rules of functional
dependencies
•
The inference rules are
a set of rules that can be used to infer all the possible FDs
from F
•
Some of these FD are
trivial
–
A FD is trivial if and
only if the right-hand side (the dependent) is a subset of the left -hand side
(the determinant).
•
One set of FD can be
equivalent to another set
What are the functional dependencies
•
Semantics
–
An employee works for a
single department, but may have multiple skills
–
Skills have an ID, and a
name; an employee takes a test at a certain date to establish a skill level
•
Definition: Determinate
–
any attribute on which
some other attribute is dependent
Do you agree with these FD?
•
empIDà {empName, empPhone, deptName}
–
What about empPhoneà {empID}
•
deptNameà{deptPhone, deptMgr}
•
skillIDà{skillName}
–
What about skillNameà {skillID} ?
•
empID,
skillIDà{skillDate, skillLevel}
•
If you had a
different set of FD, were they equivalent?
How functional dependencies are used
•
FD are used as
information about the semantics of the relational schema
•
This, in turn, is
used to determine the normal form of a relation
• Normal forms are used to analyze how “good” a design
is
•
To determine the
normal form three things are needed
–
1) a
set of FDs for each relation
–
2) a
designated primary key
–
3) tests or conditions
for each normal form
•
So, next we look at the test or conditions for the
normal forms
Normalization overview
•
Originally, E.F. Codd proposed three normal forms, called first, second, and
third normal forms
–
1NF, 2NF and
3NF are based on the functional
dependencies among the attributes of a relation
–
Later a stronger
definition of 3NF was proposed by Boyce and Codd and
is known as Boyce-Codd normal form.
–
Later, 4th
and 5th normal forms were proposed, based on other properties
•
In the
normalization process, you start with a universal relation (one table with all
the attributes) and functional dependencies
•
Then you apply
the normal form restrictions, and decompose the tables as they specify.
Normalization of data
•
This is a process of analyzing
the relation schemas based on functional dependencies (FD) and PK to
achieve two desirable properties
–
1) Minimizing data
redundancy
–
2) Minimizing insertion,
deletion and update anomalies
Using normal forms
•
When a test fails,
the relation violating that test must be decomposed
•
All these normal forms
were based on FD among the attributes of a relation
•
Disclaimers to use of
normal forms:
–
Normal forms when
considered by themselves do not guarantee good db
design.
–
Sometimes the best
design may not be the highest normal form, for performance reasons.
Review of terminology of keys
•
SUPERKEY - any set of attributes that make a tuple unique
•
KEY - a minimum superkey
•
CANDIDATE KEY - if a relation has more than one key, each is a
candidate key
•
PRIMARY KEY - an arbitrary candidate key
•
SECONDARY KEY - a candidate key that is not the PK
•
PRIME
ATTRIBUTE - a member of any key
• NONPRIME or NONKEY ATTRIBUTE - not a member of any candidate key
First Normal Form
•
1NF is now generally considered to be part of the formal
definition of a relation. It states that:
– the domains of attributes must include only atomic
values
–
the value of any attribute in a tuple must be a single
value from the domain of that attribute
–
all attributes are dependent on the PK
•
Look at this schema for
departments:
Normalizing the relation into 1NF
•
The problem with
this table is that department location is multivalued.
•
What are the
possible solutions?
–
Have the primary
key be composite {dnumber, dlocation}
•
Problems with this?
–
If you know the
possible number of locations, make attributes for three locations
•
Problem with this?
–
Make a new table
with dname and dlocation
•
This is the preferred solution
Second Normal Form
•
The second normal
form is based on the concept of a full functional dependency.
–
If a dependency
is not a full functional dependency, it is called a partial dependency
–
This can occur
only if the determinate is composite
•
A functional
dependency X -> Y is a full functional dependency if removal of any
attribute from X means that the dependency does not hold any more
•
A functional
dependency is a partial dependency if there is some attribute A in X that can be removed from X and the dependency still
holds.
Definition for 2NF
• A relation is in 2NF if:
every nonkey attribute is fully functionally
dependent on the primary key.
•
If a relation is
not in 2NF, it can be normalized into a number of 2NF relations
–
This involves
decomposing the table
•
Problem: Normalize the EMP_PROJ table
•
Procedure:
–
Figure out the
dependencies, noting the determinates
•
A determinate is any attribute(s) on which some
other attribute(s) are dependent
–
Put each
determinate in a table by itself, and
–
Include in each
table the attributes that are dependent on that determinate
EMP_PROJ
Third Normal Form
•
Look at the ENP_DEPT
relation
•
Is it in 1NF? In 2NF?
•
Is there still a
problem? What is the problem?
EMP_DEPT
The problem with the EMP_DEPT table
•
DNAME depends on DNUM,
not the primary key.
•
This is called a transitive
dependency.
•
A functional dependency
X-> Y is a transitive dependency if
–
there is a set of attributes Z of the relation and both X
-> Z and Z ->Y hold
Definition for 3NF
• A relation is in 3NF if every nonkey attribute is:
–
Fully functionally dependent on the primary key (i.e.
in 2NF).
–
Nontransitively dependent on
the primary key.
•
We can normalize EMP_DEPT
by decomposing it into two 3NF relations.
•
Intuitively, we
see that the two results represent independent entity facts.
•
A natural join
will recover the original relation.
Boyce-Codd
Normal Form
Identify FD of this relation
•
Semantics
–
Lot#’s are unique
only within each county
–
Property_ID#’s are
unique across counties for the entire state
–
The tax rate is
fixed for a given county(does not vary lot by lot)
–
The price of a
lot is determined by its area regardless of which county it is in (assume that
this is the price for the lot for tax purposes
•
What is the
PK? Normalize it into 3NF.
Generalizing the definition of 2NF and
3NF
–
To assure we do not have
update anomalies we need to extend this definition to include all candidate key
rather than just PK
•
For 2NF we disallowed partial dependencies on
any key
•
For 3NF we disallowed transitive dependencies on
any key
• Simpler definition for 3NF: Every nonprime
attribute must be
– fully functionally dependent on every key
– nontransitively dependent on every key
Boyce-Codd
Normal Form
•
This is a simpler, yet
stricter form of 3NF.
•
There are a few cases
where a DB is in 3NF and not in BCNF.
•
Example: define another FD on the Lots DB
–
All the lots in the DB come from only two
counties
–
In one county, all lots
are >= one acre
–
In the other county, all
lots are < one acre
– The FD is area -> county_name
–
Is this still in 3NF
Definition of BDNF
•
A relation is in BCNF if and only if every determinate is a candidate
key
•
To be in 3NF but not in
BCNF, must have a composite candidate key, one of whose members is
determined by a non prime attribute
•
Any relation not in BCNF
can be decomposed into an equivalent set of BCNF relations
Summary
•
In general, it is best
to have a relational schema in BCNF
•
If that is not possible,
3NF will do
•
2NF and 1NF are not
considered good relation schema designs.
–
They allow too much data
redundancy which leads to update anomalies
Summary of Normal forms
• A relation is in 1NF the domains of attributes must
include only atomic values
• A relation is in 2NF if every nonkey
attribute is fully functionally dependent on a candidate key.
• A relation is in 3NF if every nonkey
attribute is in 2NF and nontransitively dependent on a candidate key
• A relation is in BCNF if and only if every determinate is a candidate
key