Functional Dependencies
and
Normalization

Chapter 10

Designing a database

      So far, we have designed our base tables from a E-R diagram or by common sense

      We still need some formal measure of why one group of attributes is a better base table than another group.

      This chapter discusses some of the theory of what a “good” design is

 

Gathering the data

      Identify the data requirements of the users.

   This sounds simple, and it CAN be, but usually isn't because users rarely can clearly describe what they need.

   In addition, the data processing staff, who are quite knowledgeable how the system functions, often have tunnel vision, which inhibits creative thought.

Normalization helps make this process objective

      In many respects, normalization is simply applied common sense.

      It is a technique for reviewing the entity/attribute list to insure that attributes are stored "where they belong"

      We did not have any formal measure of why one grouping of attributes into a relation may be better than another.

      We will be talking about the logical level or second level of the architecture. i.e designing base tables.

Informal discussion of some of the criteria for good and bad relational schema.

      1. The meaning (semantics) of attributes

      2. Redundant Information and Update Anomalies

      3. Avoiding NULL values

      4. Avoiding generation of spurious tuples

 

1. The meaning (semantics) of attributes

      When we group attributes to form a relation, we assume that a certain meaning is associated with the attributes.

      This meaning helps us understand how the attributes in a tuple are related to each other.

      In general, the easier it is to explain the semantics of the relation, the better the relation schema design.

      GUIDELINE 1:  Design a relation schema so that it has attributes of only one entity.

 

2. Redundant Information and Update Anomalies

 (or why some organizations do not work as well as others)

      What is the semantics of these relations?

 

EMP_DEPT

 

 

EMP_PROJ

 

 

 


Why are these poor designs?

      These relations also have clear meanings.

      There is nothing logically wrong with these two relations,

   but they mix attributes from distinct real-world entities.

      They may be used as views, but they cause several problems when used as base tables.

 

What are the problems?

      They take up more storage space because of data redundancy.

      Another serious problem with using relations like this as base tables is the problem of update anomalies.

    anomalies are the result of a poorly designed database.

 

There are three types of Update Anomalies

      1. Insertion anomalies

   Must be careful that DNAME & DMGRSSN agree with DNUM

   Difficult to insert a new department that has no employees yet

      2. Deletion anomalies

   If we delete the last employee of a dept, we eliminate the information on that department

      3. Modification anomalies

   If dept 4 gets a new manager,  then all the employees who work in dept 4's tuples must be updated.

 

      GUIDELINE 2:  Design the base tables so no insertion, deletion, or modification anomalies occur in the relations.

The second guideline is consistent with and in a way a restatement of the first.

      We begin to see a need for a more formal approach to help us evaluate whether our design meets these guidelines.

 

3. Avoiding NULL values

      In some base tables we may group many attributes together into a "fat" relation

      If many of the attributes do not apply to all tuples in the relation, we end up with many NULL's in the relation.

      This can be wasteful of storage space, and may also lead to problems with understanding of the meaning of the attributes.

      NULL's can have multiple interpretations

    The attribute does not apply to this tuple

    The attribute value for this tuple is unknown

    The value is known but absent (has not been recorded yet)

 

      GUIDELINE 3:  Avoid placing attributes in a base table whose values may be NULL, as far a possible.

      For example, if only 10% of employees have individual offices, do not put an attribute OFF_NUM in the EMPLOYEE relation.

      Solution: Create a new relation EMP_OFF that includes only tuples for employees who have offices.

 

4. Avoid generation of spurious tuples

      Look at the relations EMP_LOC and EMP_PROJ1

      A tuple in EMP_LOC means that an employee works on some project at plocation

      A tuple in EMP_PROJ1 means that an employee with SSN works on project for hours/week, the name and location of the project are also kept

 

Using EMP_LOC and EMP_PROJ1 as base tables

      This is a very bad schema design

      If we do a natural join of these tables having as join conditions:

    EMP_LOC.plocation = EMP_PROJ1.plocation

      We get many spurious tuples

      A spurious tuple has wrong information; information that is not valid

      Why is this happening?

    PLOCATION is not a PK in either table, so it cannot be a FK

 

Functional Dependency

      This is a more formal, more objective design criteria.

      Functional dependency is the single most important concept in relational database design.

      A FD is a property of the meaning or semantics of the attributes of a relation.

    Before I give a formal definition, lets look at dependency informally first.

    Figure out which attributes are dependent on other attributes.

 

Formal Definition of Functional Dependency

      Given a relation R and attributes X and Y,

      we say Y is functionally dependent on X if and only if

   for each X value there is precisely one Y value associated with it.

   i.e. X functionally determines Y

      this is written like this: X -> Y

      What are the functional Dependencies of the EMP_DEPT and the EMP_PROJ relations

 

The FD of EMP_DEPT and the EMP_PROJ relations

         SSN -> {ename, bdate, address, dnumber}

   dnumber -> {dname, dmgrssn}

 

   {SSN, pnumber} ->{hours}

                        SSN -> {ename}

                 pnumber -> {pname, plocation}

 

Full functional dependency

      If the X is composite, then there is a possibility that Y is not fully functionally dependent on X.

      To be fully functionally dependent, Y must not be dependent on any proper subset of X.

      {SSN,PNUMBER} -> ENAME is not fully functionally dependent because ENAME is dependent on SSN, a subset of {SSN, PNUMBER}.

 

 

The set of FD on a relation schema

      Let the letter F denotes the set of FDs on a relation

      Example for the EMP_DEPT relation

  F = {SSN -> {ename, bdate, address, dnumber

        dnumber -> {dname, dmgrssn}}

      These are the FD that are semantically obvious

      Are they the only FD?

    There are many other FD that can be inferred or deduced from the semantic obvious ones

      Example

    SSN-> {dname, dmgrssn}  or even
SSN -> SSN,  or dnumber -> dname

 

Inferred FDs

      Suppose F is the set of functional dependencies that are specified on relational schema R

   Typically, the schema designer specifies the FD that are semantically obvious

   But there are other depenencies that can be inferred or deduced from the FDs in F

      The closure F+ of F is the set of all functional dependencies that can be inferred from F

 

Inference rules of functional dependencies

      The inference rules are a set of rules that can be used to infer all the possible FDs from F

      Some of these FD are trivial

   A FD is trivial if and only if the right-hand side (the dependent) is a subset of the left -hand side (the determinant).

      One set of FD can be equivalent to another set

 

What are the functional dependencies

 

 

 

 


      Semantics

   An employee works for a single department, but may have multiple skills

   Skills have an ID, and a name; an employee takes a test at a certain date to establish a skill level

      Definition: Determinate

    any attribute on which some other attribute is dependent

 

Do you agree with these FD?

      empIDà {empName, empPhone, deptName}

    What about empPhoneà {empID}

      deptNameà{deptPhone, deptMgr}

      skillIDà{skillName}

    What about skillNameà {skillID} ?

      empID, skillIDà{skillDate, skillLevel}

      If you had a different set of FD, were they equivalent?

 

 

How functional dependencies are used

      FD are used as information about the semantics of the relational schema

      This, in turn, is used to determine the normal form of a relation

      Normal forms are used to analyze how “good” a design is

      To determine the normal form three things are needed

   1) a set of FDs for each relation

   2) a designated primary key

   3)  tests or conditions for each normal form

   So, next we look at the test or conditions for the normal forms

 

Normalization overview

      Originally, E.F. Codd proposed three normal forms, called first, second, and third normal forms

    1NF, 2NF and 3NF are based on the functional dependencies among the attributes of a relation

    Later a stronger definition of 3NF was proposed by Boyce and Codd and is known as Boyce-Codd normal form.

    Later, 4th and 5th normal forms were proposed, based on other properties

      In the normalization process, you start with a universal relation (one table with all the attributes) and functional dependencies

      Then you apply the normal form restrictions, and decompose the tables as they specify.

 

Normalization of data

      This is a process of analyzing the relation schemas based on functional dependencies (FD) and PK to achieve two desirable properties

   1) Minimizing data redundancy

   2) Minimizing insertion, deletion and update anomalies

 

Using normal forms

      When a test fails, the relation violating that test must be decomposed

      All these normal forms were based on FD among the attributes of a relation

      Disclaimers to use of normal forms:

   Normal forms when considered by themselves do not guarantee good db design.

   Sometimes the best design may not be the highest normal form, for performance reasons.

 

Review of terminology of keys

      SUPERKEY - any set of attributes that make a tuple unique

      KEY - a minimum superkey

      CANDIDATE KEY - if a relation has more than one key, each is a candidate key

      PRIMARY KEY - an arbitrary candidate key

      SECONDARY KEY - a candidate key that is not the PK

      PRIME ATTRIBUTE - a member of any key

      NONPRIME or NONKEY ATTRIBUTE - not a member of any candidate key

 

 

 

 

First Normal Form

      1NF is now generally considered to be part of the formal definition of a relation. It states that:

   the domains of attributes must include only atomic values

   the value of any attribute in a tuple must be a single value from the domain of that attribute

   all attributes are dependent on the PK

      Look at this schema for departments:

 

 

 


Normalizing the relation into 1NF

      The problem with this table is that department location is multivalued.

      What are the possible solutions?

    Have the primary key be composite  {dnumber, dlocation}

    Problems with this?

    If you know the possible number of locations, make attributes for three locations

    Problem with this?

    Make a new table with dname and dlocation

    This is the preferred solution

 

 

 

 

Second Normal Form

      The second normal form is based on the concept of a full functional dependency.

    If a dependency is not a full functional dependency, it is called a partial dependency

    This can occur only if the determinate is composite

      A functional dependency X -> Y is a full functional dependency if removal of any attribute from X means that the dependency does not hold any more

      A functional dependency is a partial dependency if there is some attribute A in X that can be removed from X and the dependency still holds.

 

Definition for 2NF

      A relation is in 2NF if:
every nonkey attribute is fully functionally dependent on the primary key.

      If a relation is not in 2NF, it can be normalized into a number of 2NF relations

    This involves decomposing the table

      Problem:  Normalize the EMP_PROJ table

      Procedure:

    Figure out the dependencies, noting the determinates

    A determinate is any attribute(s) on which some other attribute(s) are dependent

    Put each determinate in a table by itself, and

    Include in each table the attributes that are dependent on that determinate

EMP_PROJ

 

 

 

 

 

 


Third Normal Form

      Look at the ENP_DEPT relation

      Is it in 1NF?  In 2NF?

      Is there still a problem?  What is the problem?

 

EMP_DEPT

 

 

 

 

The problem with the EMP_DEPT table

      DNAME depends on DNUM, not the primary key.

      This is called a transitive dependency.

      A functional dependency X-> Y is a transitive dependency if

   there is a set of attributes Z of the relation and both X -> Z and Z ->Y hold

 

Definition for 3NF

      A relation is in 3NF if every nonkey attribute is:

   Fully functionally dependent on the primary key (i.e. in 2NF).

   Nontransitively dependent on the primary key.

      We can normalize EMP_DEPT by decomposing it into two 3NF relations.

      Intuitively, we see that the two results represent independent entity facts.

      A natural join will recover the original relation.

 

 

 

Boyce-Codd Normal Form

 

Identify FD of this relation

 

 

 


      Semantics

     Lot#’s are unique only within each county

     Property_ID#’s are unique across counties for the entire state

    The tax rate is fixed for a given county(does not vary lot by lot)

    The price of a lot is determined by its area regardless of which county it is in (assume that this is the price for the lot for tax purposes

      What is the PK?   Normalize it into 3NF.

 

Generalizing the definition of 2NF and 3NF

   To assure we do not have update anomalies we need to extend this definition to include all candidate key rather than just PK

   For 2NF we disallowed partial dependencies on any key

   For 3NF we disallowed transitive dependencies on any key

      Simpler definition for 3NF: Every nonprime attribute must be

   fully functionally dependent on every key

   nontransitively dependent on every key

 

 

Boyce-Codd Normal Form

      This is a simpler, yet stricter form of 3NF.

      There are a few cases where a DB is in 3NF and not in BCNF.

      Example:  define another FD on the Lots DB

   All the lots in the DB  come from only two counties

   In one county, all lots are >= one acre

   In the other county, all lots are < one acre

   The FD is area -> county_name

   Is this still in 3NF

 

Definition of BDNF

      A relation is in BCNF if and only if every determinate is a candidate key

      To be in 3NF but not in BCNF, must have a composite candidate key, one of whose members is determined by a non prime attribute

      Any relation not in BCNF can be decomposed into an equivalent set of BCNF relations

 

Summary

      In general, it is best to have a relational schema in BCNF

      If that is not possible, 3NF will do

      2NF and 1NF are not considered good relation schema designs. 

   They allow too much data redundancy which leads to update anomalies

 

Summary of Normal forms

      A relation is in 1NF the domains of attributes must include only atomic values

      A relation is in 2NF if every nonkey attribute is fully functionally dependent on a candidate key.

 

      A relation is in 3NF if every nonkey attribute is in 2NF and  nontransitively dependent on a candidate key

 

      A relation is in BCNF if and only if every determinate is a candidate key