Functional Dependencies
and
Normalization

Chapter 14

Designing a database

•      So far, we have designed our base tables from a E-R diagram or by common sense

•      We still need some formal measure of why one group of attributes is a better base table than another group.

•      This chapter discusses some of the theory of what a “good” design is

Gathering the data

•      Identify the data requirements of the users.

–   This sounds simple, and it CAN be, but usually isn't because users rarely can clearly describe what they need.

–   In addition, the data processing staff, who are quite knowledgeable how the system functions, often have tunnel vision, which inhibits creative thought.

Normalization helps make this process objective

•      In many respects, normalization is simply applied common sense.

•      It is a technique for reviewing the entity/attribute list to insure that attributes are stored "where they belong"

•      We did not have any formal measure of why one grouping of attributes into a relation may be better than another.

•      We will be talking about the logical level or second level of the architecture. i.e designing base tables.

Informal discussion of some of the criteria for good and bad relational schema.

•      1. The meaning (semantics) of attributes

•      2. Redundant Information and Update Anomalies

•      3. Avoiding NULL values

•      4. Avoiding generation of spurious tuples

1. The meaning (semantics) of attributes

•      When we group attributes to form a relation, we assume that a certain meaning is associated with the attributes.

•      This meaning helps us understand how the attributes in a tuple are related to each other.

•      In general, the easier it is to explain the semantics of the relation, the better the relation schema design.

 

•      GUIDELINE 1:  Design a relation schema in such a way that it is easy to explain its meaning.

Other design possibilities

•      (or why some organizations do not work as well as others)

•      What is the semantics of these relations?

EMP_DEPT

 

 

EMP_PROJ

 

 

 


Why are these poor designs?

•      These relations also have clear meanings.

•      There is nothing logically wrong with these two relations,

–   but they mix attributes from distinct real-world entities.

•      They may be used as views, but they cause several problems when used as base tables.

What are the problems?

•      They take up more storage space because of data redundancy.

•      Another serious problem with using relations like this as base tables is the problem of update anomalies.

–    anomalies are the result of a poorly designed database.

 

There are three types of Update Anomalies

•      1. Insertion anomalies

–   Must be careful that DNAME & DMGRSSN agree with DNUM

–   Difficult to insert a new department that has no employees yet

•      2. Deletion anomalies

–   If we delete the last employee of a dept, we eliminate the information on that department

•      3. Modification anomalies

–   If dept 4 gets a new manager,  then all the employees who work in dept 4's tuples must be updated.

 

2. Redundant Information and Update Anomalies

•      GUIDELINE 2:  Design the base tables so no insertion, deletion, or modification anomalies occur in the relations.

The second guideline is consistent with and in a way a restatement of the first.

•      We begin to see a need for a more formal approach to help us evaluate whether our design meets these guidelines.

 

3. Avoiding NULL values

•      In some base tables we may group many attributes together into a "fat" relation

•      If many of the attributes do not apply to all tuples in the relation, we end up with many NULL's in the relation.

•      This can be wasteful of storage space, and may also lead to problems with understanding of the meaning of the attributes.

•      NULL's can have multiple interpretations

–    The attribute does not apply to this tuple

–    The attribute value for this tuple is unknown

–    The value is known but absent (has not been recorded yet)

 

Avoiding NULL values

•      GUIDELINE 3:  Avoid placing attributes in a base table whose values may be NULL, as far a possible.

•      For example, if only 10% of employees have individual offices, do not put an attribute OFF_NUM in the EMPLOYEE relation.

•      Solution: Create a new relation EMP_OFF that includes only tuples for employees who have offices.

 

Avoid generation of spurious tuples

•      Look at the relations EMP_LOC and EMP_PROJ1

•      A tuple in EMP_LOC means that an employee works on some project at plocation

•      A tuple in EMP_PROJ1 means that an employee with SSN works on project for hours/week, the name and location of the project are also kept

 

Using EMP_LOC and EMP_PROJ1 as base tables

•      This is a very bad schema design

•      If we do a natural join of these tables having as join conditions:

–    EMP_LOC.plocation = EMP_PROJ1.plocation

•      We get many spurious tuples

•      A spurious tuple has wrong information; information that is not valid

•      Why is this happening?

–    PLOCATION is not a PK in either table, so it cannot be a FK

 

Functional Dependency

•      This is a more formal, more objective design criteria.

•      Functional dependency is the single most important concept in relational database design.

•      A FD is a property of the meaning or semantics of the attributes of a relation.

–    Before I give a formal definition, lets look at dependency informally first.

–    Figure out which attributes are dependent on other attributes.

 

Formal Definition of Functional Dependency

•      Given a relation R and attributes X and Y,

•      we say Y is functionally dependent on X if and only if

–   for each X value there is precisely one Y value associated with it.

–   i.e. X functionally determines Y

•      this is written like this: X -> Y

•      What are the functional Dependencies of the EMP_DEPT and the EMP_PROJ relations

 

The FD of EMP_DEPT and the EMP_PROJ relations

         SSN -> {ename, bdate, address, dnumber}

   dnumber -> {dname, dmgrssn}

 

   {SSN, pnumber} ->{hours}

                        SSN -> {ename}

                 pnumber -> {pname, plocation}

 

Full functional dependency

•      If the X is composite, then there is a possibility that Y is not fully functionally dependent on X.

•      To be fully functionally dependent, Y must not be dependent on any proper subset of X.

•      {SSN,PNUMBER} -> ENAME is not fully functionally dependent because ENAME is dependent on SSN, a subset of {SSN, PNUMBER}.

 

Property of functional dependencies

•      A FD is a property of the relation schema, not of a particular extension or instance of the relation

•      It is actually a constraint on the schema

•      The DB designer uses their understanding of the meaning of the attributes to specify the functional dependency

•      These FD must hold for all instances of the DB at all times

 

The set of FD on a relation schema

•      Let the letter F denotes the set of FDs on a relation

•      Example for the EMP_DEPT relation

  F = {SSN -> {ename, bdate, address, dnumber

        dnumber -> {dname, dmgrssn}}

•      These are the FD that are semantically obvious

•      Are they the only FD?

–    There are many other FD that can be inferred or deduced from the semantic obvious ones

•      Example

–    SSN-> {dname, dmgrssn}  or even
SSN -> SSN,  or dnumber -> dname

 

Inferred FDs

•      Suppose F is the set of functional dependencies that are specified on relational schema R

–   Typically, the schema designer specifies the FD that are semantically obvious

–   But there are other depenencies that can be inferred or deduced from the FDs in F

•      The closure F+ of F is the set of all functional dependencies that can be inferred from F

 

Properties of functional dependencies

•      The inference rules are a set of rules that can be used to infer all the possible FDs from F

•      Some of these FD are trivial

–   A FD is trivial if and only if the right-hand side (the dependent) is a subset of the left -hand side (the determinant).

•      One set of FD can be equivalent to another set

 

What are the functional dependencies

 

 

 

 


•      Semantics

–   An employee works for a single department, but may have multiple skills

–   Skills have an ID, and a name; an employee takes a test at a certain date to establish a skill level

•      Definition: Determinate

–    any attribute on which some other attribute is dependent

 

Do you agree with these FD?

•      empIDà {empName, empPhone, deptName}

–    What about empPhoneà {empID}

•      deptNameà{deptPhone, deptMgr}

•      skillIDà{skillName}

–    What about skillNameà {skillID} ?

•      empID, skillIDà{skillDate, skillLevel}

•      If you had a different set of FD, were they equivalent?

 

 

How functional dependencies are used

•      FD are used as information about the semantics of the relational schema

•      This, in turn, is used to determine the normal form of a relation

•      Normal forms are used to analyze how “good” a design is

•      To determine the normal form three things are needed

–   1) a set of FDs for each relation

–   2) a designated primary key

–   3)  tests or conditions for each normal form

•   So, next we look at the test or conditions for the normal forms

 

Normalization overview

•      Originally, E.F. Codd proposed three normal forms, called first, second, and third normal forms

–    1NF, 2NF and 3NF are based on the functional dependencies among the attributes of a relation

–    Later a stronger definition of 3NF was proposed by Boyce and Codd and is known as Boyce-Codd normal form.

–    Later, 4th and 5th normal forms were proposed, based on other properties

•      In the normalization process, you start with a universal relation (one table with all the attributes) and functional dependencies

•      Then you apply the normal form restrictions, and decompose the tables as they specify.

 

Normalization of data

•      This is a process of analyzing the relation schemas based on functional dependencies (FD) and PK to achieve two desirable properties

–   1) Minimizing data redundancy

–   2) Minimizing insertion, deletion and update anomalies

 

Using normal forms

•      When a test fails, the relation violating that test must be decomposed

•      All these normal forms were based on FD among the attributes of a relation

•      Disclaimers to use of normal forms:

–   Normal forms when considered by themselves do not guarantee good db design.

–   Sometimes the best design may not be the highest normal form, for performance reasons.

 

Review of terminology of keys

•      SUPERKEY - any set of attributes that make a tuple unique

•      KEY - a minimum superkey

•      CANDIDATE KEY - if a relation has more than one key, each is a candidate key

•      PRIMARY KEY - an arbitrary candidate key

•      SECONDARY KEY - a candidate key that is not the PK

•      PRIME ATTRIBUTE - a member of any key

•      NONPRIME or NONKEY ATTRIBUTE - not a member of any candidate key

 

First Normal Form

•      1NF is now generally considered to be part of the formal definition of a relation. It states that:

–   the domains of attributes must include only atomic values

–   the value of any attribute in a tuple must be a single value from the domain of that attribute

–   all attributes are dependent on the PK

•      Look at this schema for departments:

 

 

 


Normalizing the relation into 1NF

•      The problem with this table is that department location is multivalued.

•      What are the possible solutions?

–    Have the primary key be composite  {dnumber, dlocation}

•    Problems with this?

–    If you know the possible number of locations, make attributes for three locations

•    Problem with this?

–    Make a new table with dname and dlocation

•    This is the preferred solution

 

Second Normal Form

•      The second normal form is based on the concept of a full functional dependency.

–    If a dependency is not a full functional dependency, it is called a partial dependency

–    This can occur only if the determinate is composite

•      A functional dependency X -> Y is a full functional dependency if removal of any attribute from X means that the dependency does not hold any more

•      A functional dependency is a partial dependency if there is some attribute A in X that can be removed from X and the dependency still holds.

 

Definition for 2NF

•      A relation is in 2NF if:
every nonkey attribute is fully functionally dependent on the primary key.

•      If a relation is not in 2NF, it can be normalized into a number of 2NF relations

–    This involves decomposing the table

•      Problem:  Normalize the EMP_PROJ table

•      Procedure:

–    Figure out the dependencies, noting the determinates

•    A determinate is any attribute(s) on which some other attribute(s) are dependent

–    Put each determinate in a table by itself, and

–    Include in each table the attributes that are dependent on that determinate

EMP_PROJ

 

 

 


Third Normal Form

•      Look at the ENP_DEPT relation

•      Is it in 1NF?  In 2NF?

•      Is there still a problem?  What is the problem?

 

EMP_DEPT

 

 

 

 

The problem with the EMP_DEPT table

•      DNAME depends on DNUM, not the primary key.

•      This is called a transitive dependency.

•      A functional dependency X-> Y is a transitive dependency if

–   there is a set of attributes Z of the relation and both X -> Z and Z ->Y hold

 

Definition for 3NF

•      A relation is in 3NF if every nonkey attribute is:

–   Fully functionally dependent on the primary key (i.e. in 2NF).

–   Nontransitively dependent on the primary key.

•      We can normalize EMP_DEPT by decomposing it into two 3NF relations.

•      Intuitively, we see that the two results represent independent entity facts.

•      A natural join will recover the original relation.

 

Identify FD of this relation

 

 

 


•      Semantics

–     Lot#’s are unique only within each county

–     Property_ID#’s are unique across counties for the entire state

–    The tax rate is fixed for a given county(does not vary lot by lot)

–    The price of a lot is determined by its area regardless of which county it is in (assume that this is the price for the lot for tax purposes

•      What is the PK?   Normalize it into 3NF.

 

Generalizing the definition of 2NF and 3NF

–   To assure we do not have update anomalies we need to extend this definition to include all candidate key rather than just PK

•   For 2NF we disallowed partial dependencies on any key

•   For 3NF we disallowed transitive dependencies on any key

•      Simpler definition for 3NF: Every nonprime attribute must be

–   fully functionally dependent on every key

–   nontransitively dependent on every key

 

 

Boyce-Codd Normal Form

•      This is a simpler, yet stricter form of 3NF.

•      There are a few cases where a DB is in 3NF and not in BCNF.

•      Example:  define another FD on the Lots DB

–   All the lots in the DB  come from only two counties

–   In one county, all lots are >= one acre

–   In the other county, all lots are < one acre

–   The FD is area -> county_name

–   Is this still in 3NF

 

Definition of BDNF

•      A relation is in BCNF if and only if every determinate is a candidate key

•      To be in 3NF but not in BCNF, must have a composite candidate key, one of whose members is determined by a non prime attribute

•      Any relation not in BCNF can be decomposed into an equivalent set of BCNF relations

 

Summary

•      In general, it is best to have a relational schema in BCNF

•      If that is not possible, 3NF will do

•      2NF and 1NF are not considered good relation schema designs. 

–   They allow too much data redundancy which leads to update anomalies

 

Summary of Normal forms

•      A relation is in 1NF the domains of attributes must include only atomic values

•      A relation is in 2NF if every nonkey attribute is fully functionally dependent on a candidate key.

 

•      A relation is in 3NF if every nonkey attribute is in 2NF and  nontransitively dependent on a candidate key

 

•      A relation is in BCNF if and only if every determinate is a candidate key