Functional
Dependencies
and
Normalization
Chapter 14
Designing a database
• So far, we have designed our base tables from a E-R
diagram or by common sense
• We still need some formal measure of why one group of attributes
is a better base table than another group.
• This chapter discusses some of the theory of what a
“good” design is
Gathering the data
•
Identify the
data requirements of the users.
– This sounds simple, and it CAN be, but usually isn't
because users rarely can clearly describe what they need.
– In addition, the data processing staff,
who are quite knowledgeable how the system functions, often have tunnel
vision, which inhibits creative thought.
Normalization helps make this process
objective
•
In many respects,
normalization is simply applied common sense.
•
It is a technique
for reviewing the entity/attribute list to insure that attributes are
stored "where they belong"
• We did not have any formal measure of why one grouping
of attributes into a relation may be better than another.
•
We will be
talking about the logical level or second level of the architecture. i.e designing base tables.
Informal discussion of
some of the criteria for good and bad relational schema.
•
1. The meaning
(semantics) of attributes
•
2. Redundant
Information and Update Anomalies
•
3. Avoiding
NULL values
•
4. Avoiding
generation of spurious tuples
1. The meaning (semantics) of attributes
• When we group attributes to form a relation, we assume
that a certain meaning is associated with the attributes.
• This meaning helps us understand how the attributes in
a tuple are related to each other.
• In general, the easier it is to explain the semantics
of the relation, the better the relation schema design.
• GUIDELINE 1: Design a relation schema in such a way that
it is easy to explain its meaning.
Other design possibilities
•
(or why some organizations do not work as well as others)
• What is the semantics of these relations?
EMP_DEPT
EMP_PROJ
Why are these poor designs?
• These relations also have clear meanings.
• There is nothing logically wrong with these two
relations,
– but they mix attributes from distinct real-world
entities.
•
They may be used
as views, but they cause several problems when used as base tables.
What are the problems?
• They take up more storage space because of data
redundancy.
• Another serious problem with using relations like this
as base tables is the problem of update anomalies.
–
anomalies are the
result of a poorly designed database.
There are three types of Update Anomalies
• 1. Insertion
anomalies
– Must be careful that DNAME & DMGRSSN agree with
DNUM
– Difficult to insert a new department that has no
employees yet
•
2. Deletion
anomalies
– If we delete the last employee of a dept, we eliminate
the information on that department
•
3. Modification
anomalies
– If dept
4 gets a new manager, then all the
employees who work in dept 4's tuples must be updated.
2. Redundant Information and Update
Anomalies
•
GUIDELINE 2: Design the base
tables so no insertion, deletion, or modification anomalies occur in the
relations.
The second guideline is consistent
with and in a way a restatement of the first.
• We begin to see a need for a more formal approach to
help us evaluate whether our design meets these guidelines.
3. Avoiding NULL values
• In some base tables we may group many attributes
together into a "fat" relation
• If many of the attributes do not apply to all tuples
in the relation, we end up with many NULL's in the
relation.
• This can be wasteful of storage space, and may also
lead to problems with understanding of the meaning of the attributes.
• NULL's can have multiple interpretations
–
The attribute
does not apply to this tuple
–
The attribute
value for this tuple is unknown
– The value is known but absent (has not been recorded
yet)
Avoiding NULL values
• GUIDELINE 3: Avoid placing attributes in a base table
whose values may be NULL, as far a possible.
• For example, if only 10% of employees have individual offices,
do not put an attribute OFF_NUM in the EMPLOYEE relation.
• Solution: Create a new relation EMP_OFF that includes
only tuples for employees who have offices.
Avoid generation of spurious tuples
•
Look at the
relations EMP_LOC and EMP_PROJ1
• A tuple in EMP_LOC means that an employee works
on some project at plocation
• A tuple in EMP_PROJ1 means that an employee
with SSN works on project for hours/week, the name and location of the project
are also kept
Using
EMP_LOC and EMP_PROJ1 as base tables
• This is a very bad schema design
• If we do a natural join of these tables having as join
conditions:
–
EMP_LOC.plocation = EMP_PROJ1.plocation
• We get many spurious tuples
• A spurious tuple has wrong information; information
that is not valid
• Why is this happening?
–
PLOCATION is not
a PK in either table, so it cannot be a FK
Functional Dependency
• This is a more formal, more objective design criteria.
• Functional dependency is the single most important
concept in relational database design.
• A FD is a property of the meaning or semantics
of the attributes of a relation.
–
Before I give a
formal definition, lets look at dependency informally first.
–
Figure
out which attributes are dependent on other attributes.
Formal Definition of Functional
Dependency
• Given a relation R and attributes X and Y,
• we say Y is functionally dependent on X if and only if
– for each X value there is precisely one Y value
associated with it.
– i.e. X functionally determines Y
• this is written like this: X -> Y
• What are the functional Dependencies of the EMP_DEPT
and the EMP_PROJ relations
The FD of EMP_DEPT and the EMP_PROJ
relations
SSN -> {ename, bdate,
address, dnumber}
dnumber -> {dname,
dmgrssn}
{SSN, pnumber} ->{hours}
SSN -> {ename}
pnumber -> {pname, plocation}
Full functional dependency
• If the X is composite, then there is a
possibility that Y is not fully functionally dependent on X.
• To be fully functionally dependent, Y must not be
dependent on any proper subset of X.
• {SSN,PNUMBER} -> ENAME is not fully functionally dependent because ENAME is
dependent on SSN, a subset of {SSN, PNUMBER}.
Property of functional dependencies
• A FD is a property of the relation schema, not
of a particular extension or instance of the relation
• It is actually a constraint on the schema
• The DB designer uses their understanding of the
meaning of the attributes to specify the functional dependency
• These FD must hold for all instances of the DB
at all times
The set of FD on a relation schema
• Let the letter F denotes the set of FDs on a relation
• Example for the EMP_DEPT relation
F
= {SSN -> {ename,
bdate, address, dnumber
dnumber -> {dname, dmgrssn}}
• These are the FD that are semantically obvious
• Are they the only FD?
–
There are many
other FD that can be inferred or deduced from the semantic obvious ones
• Example
–
SSN-> {dname, dmgrssn} or even
SSN -> SSN, or dnumber
-> dname
Inferred FDs
•
Suppose F
is the set of functional dependencies that are specified on relational schema R
– Typically, the schema designer specifies the FD that
are semantically obvious
– But there are other depenencies
that can be inferred or deduced from the FDs in F
• The closure F+ of F is the set of all
functional dependencies that can be inferred from F
Properties of functional dependencies
• The inference rules are a set of rules that can be
used to infer all the possible FDs from F
• Some of these FD are trivial
– A FD is trivial if and only if the right-hand side (the
dependent) is a subset of the left -hand side (the determinant).
• One set of FD can be equivalent to another set
What
are the functional dependencies
• Semantics
– An employee works for a single department, but may
have multiple skills
– Skills have an ID, and a name; an employee takes a
test at a certain date to establish a skill level
• Definition: Determinate
– any attribute on which some other attribute is dependent
Do
you agree with these FD?
• empIDà {empName, empPhone, deptName}
–
What about empPhoneà {empID}
• deptNameà{deptPhone, deptMgr}
• skillIDà{skillName}
–
What about skillNameà {skillID} ?
• empID, skillIDà{skillDate, skillLevel}
• If you had a different set of FD, were they
equivalent?
How functional dependencies are used
• FD are used as information about the semantics
of the relational schema
• This, in turn, is used to determine the normal form
of a relation
•
Normal forms
are used to analyze how “good” a design is
• To determine the normal form three things are needed
– 1) a set of FDs for each relation
– 2) a designated primary key
– 3) tests or conditions for each normal form
•
So, next we look at the test or conditions for the
normal forms
Normalization overview
• Originally, E.F. Codd
proposed three normal forms, called first, second, and third normal forms
–
1NF, 2NF and
3NF are based on the functional
dependencies among the attributes of a relation
–
Later a stronger
definition of 3NF was proposed by Boyce and Codd and
is known as Boyce-Codd normal form.
–
Later, 4th
and 5th normal forms were proposed, based on other properties
• In the normalization process, you start with a
universal relation (one table with all the attributes) and functional
dependencies
• Then you apply the normal form restrictions, and
decompose the tables as they specify.
Normalization of data
• This is a process of analyzing the relation schemas
based on functional dependencies (FD) and PK to achieve two desirable
properties
– 1) Minimizing data redundancy
– 2) Minimizing insertion, deletion and update anomalies
Using normal forms
• When a test fails, the relation violating that
test must be decomposed
• All these normal forms were based on FD among the
attributes of a relation
• Disclaimers to use of normal forms:
– Normal forms when considered by themselves
do not guarantee good db design.
– Sometimes the best design may not be the highest
normal form, for performance reasons.
Review of terminology of keys
• SUPERKEY -
any set of attributes that make a tuple unique
• KEY - a
minimum superkey
• CANDIDATE KEY
- if a relation has more than one key, each is a candidate key
• PRIMARY KEY
- an arbitrary candidate key
• SECONDARY KEY
- a candidate key that is not the PK
• PRIME ATTRIBUTE
- a member of any key
•
NONPRIME or
NONKEY ATTRIBUTE - not a member of
any candidate key
First Normal Form
• 1NF is now
generally considered to be part of the formal definition of a relation. It
states that:
–
the domains of attributes must include only atomic
values
– the value of any attribute in a tuple must be a single
value from the domain of that attribute
– all attributes are dependent on the PK
• Look at this schema for departments:
Normalizing
the relation into 1NF
• The problem with this table is that department
location is multivalued.
• What are the possible solutions?
–
Have the primary
key be composite {dnumber, dlocation}
• Problems with this?
–
If you know the
possible number of locations, make attributes for three locations
• Problem with this?
–
Make a new table
with dname and dlocation
• This
is the preferred solution
Second Normal Form
• The second normal form is based on the concept of a full
functional dependency.
–
If a dependency
is not a full functional dependency, it is called a partial dependency
–
This can occur
only if the determinate is composite
• A functional dependency X -> Y is a full functional
dependency if removal of any attribute from X means that the dependency does
not hold any more
• A functional dependency is a partial dependency
if there is some attribute A in X that can be removed
from X and the dependency still holds.
Definition for 2NF
• A relation is in 2NF if:
every nonkey attribute is fully functionally
dependent on the primary key.
• If a relation is not in 2NF, it can be normalized
into a number of 2NF relations
–
This involves
decomposing the table
• Problem: Normalize
the EMP_PROJ table
• Procedure:
–
Figure out the
dependencies, noting the determinates
• A
determinate is any attribute(s) on which some other attribute(s) are dependent
–
Put each
determinate in a table by itself, and
–
Include in each
table the attributes that are dependent on that determinate
EMP_PROJ
Third Normal Form
• Look at the ENP_DEPT relation
• Is it in 1NF? In 2NF?
• Is there still a problem? What is the problem?
EMP_DEPT
The problem with the EMP_DEPT table
• DNAME depends on DNUM, not the primary key.
• This is called a transitive dependency.
• A functional dependency X-> Y is a transitive
dependency if
– there is a set of attributes Z of the relation and both X
-> Z and Z ->Y hold
Definition for 3NF
• A relation is in 3NF if every nonkey attribute is:
–
Fully functionally dependent on the primary key (i.e.
in 2NF).
–
Nontransitively dependent on
the primary key.
• We can normalize EMP_DEPT by decomposing it
into two 3NF relations.
• Intuitively, we see that the two results represent
independent entity facts.
• A natural join will recover the original relation.
Identify FD of this relation
• Semantics
–
Lot#’s are
unique only within each county
–
Property_ID#’s are
unique across counties for the entire state
–
The tax rate is
fixed for a given county(does not vary lot by lot)
–
The price of a
lot is determined by its area regardless of which county it is in (assume that
this is the price for the lot for tax purposes
• What is the PK?
Normalize it into 3NF.
Generalizing
the definition of 2NF and 3NF
– To assure we do not have update anomalies we need to
extend this definition to include all candidate key rather than just PK
•
For 2NF we disallowed partial dependencies on
any key
•
For 3NF we disallowed transitive dependencies on
any key
•
Simpler
definition for 3NF: Every nonprime attribute must be
–
fully functionally
dependent on every key
–
nontransitively dependent on every key
Boyce-Codd Normal Form
• This is a simpler, yet stricter form of 3NF.
• There are a few cases where a DB is in 3NF and not in
BCNF.
• Example: define
another FD on the Lots DB
– All the lots in the DB come from only two counties
– In one county, all lots are >= one acre
– In the other county, all lots are < one acre
–
The FD is area
-> county_name
– Is this still in 3NF
Definition
of BDNF
• A relation is in BCNF if and only if
every determinate is a candidate key
• To be in 3NF but not in BCNF, must have a composite
candidate key, one of whose members is determined by a non prime attribute
• Any relation not in BCNF can be decomposed into an
equivalent set of BCNF relations
Summary
• In general, it is best to have a relational schema in
BCNF
• If that is not possible, 3NF will do
• 2NF and 1NF are not considered good relation schema
designs.
– They allow too much data redundancy which leads to
update anomalies
Summary
of Normal forms
•
A relation is in 1NF the domains of attributes must include only atomic values
•
A relation is
in 2NF if every nonkey attribute is fully functionally dependent on a
candidate key.
•
A relation is
in 3NF if every nonkey attribute is in 2NF and nontransitively
dependent on a candidate key
•
A relation is
in BCNF if and
only if every determinate is a candidate key