Functional Dependencies
and
Normalization

Chapter 10

Designing a database

• So far, we have designed our base tables from a E-R diagram or by common sense

• We still need some formal measure of why one group of attributes is a better base table than another group.

• This chapter discusses some of the theory of what a “good” design is

Gathering the data

• Identify the data requirements of the users.

– This sounds simple, and it CAN be, but usually isn't because users rarely can clearly describe what they need.

– In addition, the data processing staff, who are quite knowledgeable how the system functions, often have tunnel vision, which inhibits creative thought.

Normalization helps make this process objective

• In many respects, normalization is simply applied common sense.

• It is a technique for reviewing the entity/attribute list to insure that attributes are stored "where they belong"

• We did not have any formal measure of why one grouping of attributes into a relation may be better than another.

• We will be talking about the logical level or second level of the architecture. i.e designing base tables.

Informal discussion of some of the criteria for good and bad relational schema.

• 1. The meaning (semantics) of attributes

• 2. Redundant Information and Update Anomalies

• 3. Avoiding NULL values

• 4. Avoiding generation of spurious tuples

1. The meaning (semantics) of attributes

• When we group attributes to form a relation, we assume that a certain meaning is associated with the attributes.

• This meaning helps us understand how the attributes in a tuple are related to each other.

• In general, the easier it is to explain the semantics of the relation, the better the relation schema design.

• GUIDELINE 1: Design a relation schema so that it has attributes of only one entity.

2. Redundant Information and Update Anomalies

(or why some organizations do not work as well as others)

• What is the semantics of these relations?

EMP_DEPT

EMP_PROJ

Why are these poor designs?

• These relations also have clear meanings.

• There is nothing logically wrong with these two relations,

– but they mix attributes from distinct real-world entities.

• They may be used as views, but they cause several problems when used as base tables.

What are the problems?

• They take up more storage space because of data redundancy.

• Another serious problem with using relations like this as base tables is the problem of update anomalies.

– anomalies are the result of a poorly designed database.

There are three types of Update Anomalies

• 1. Insertion anomalies

– Must be careful that DNAME & DMGRSSN agree with DNUM

– Difficult to insert a new department that has no employees yet

• 2. Deletion anomalies

– If we delete the last employee of a dept, we eliminate the information on that department

• 3. Modification anomalies

– If dept 4 gets a new manager, then all the employees who work in dept 4's tuples must be updated.

• GUIDELINE 2: Design the base tables so no insertion, deletion, or modification anomalies occur in the relations.

The second guideline is consistent with and in a way a restatement of the first.

• We begin to see a need for a more formal approach to help us evaluate whether our design meets these guidelines.

3. Avoiding NULL values

• In some base tables we may group many attributes together into a "fat" relation

• If many of the attributes do not apply to all tuples in the relation, we end up with many NULL's in the relation.

• This can be wasteful of storage space, and may also lead to problems with understanding of the meaning of the attributes.

• NULL's can have multiple interpretations

– The attribute does not apply to this tuple

– The attribute value for this tuple is unknown

– The value is known but absent (has not been recorded yet)

• GUIDELINE 3: Avoid placing attributes in a base table whose values may be NULL, as far a possible.

• For example, if only 10% of employees have individual offices, do not put an attribute OFF_NUM in the EMPLOYEE relation.

• Solution: Create a new relation EMP_OFF that includes only tuples for employees who have offices.

4. Avoid generation of spurious tuples

• Look at the relations EMP_LOC and EMP_PROJ1

• A tuple in EMP_LOC means that an employee works on some project at plocation

• A tuple in EMP_PROJ1 means that an employee with SSN works on project for hours/week, the name and location of the project are also kept

Using EMP_LOC and EMP_PROJ1 as base tables

• This is a very bad schema design

• If we do a natural join of these tables having as join conditions:

– EMP_LOC.plocation = EMP_PROJ1.plocation

• We get many spurious tuples

• A spurious tuple has wrong information; information that is not valid

• Why is this happening?

– PLOCATION is not a PK in either table, so it cannot be a FK

Functional Dependency

• This is a more formal, more objective design criteria.

• Functional dependency is the single most important concept in relational database design.

• A FD is a property of the meaning or semantics of the attributes of a relation.

– Before I give a formal definition, lets look at dependency informally first.

– Figure out which attributes are dependent on other attributes.

Formal Definition of Functional Dependency

• Given a relation R and attributes X and Y,

• we say Y is functionally dependent on X if and only if

– for each X value there is precisely one Y value associated with it.

– i.e. X functionally determines Y

• this is written like this: X -> Y

• What are the functional Dependencies of the EMP_DEPT and the EMP_PROJ relations

The FD of EMP_DEPT and the EMP_PROJ relations

SSN -> {ename, bdate, address, dnumber}

dnumber -> {dname, dmgrssn}

{SSN, pnumber} ->{hours}

SSN -> {ename}

pnumber -> {pname, plocation}

Full functional dependency

• If the X is composite, then there is a possibility that Y is not fully functionally dependent on X.

• To be fully functionally dependent, Y must not be dependent on any proper subset of X.

• {SSN,PNUMBER} -> ENAME is not fully functionally dependent because ENAME is dependent on SSN, a subset of {SSN, PNUMBER}.

The set of FD on a relation schema

• Let the letter F denotes the set of FDs on a relation

• Example for the EMP_DEPT relation

F = {SSN -> {ename, bdate, address, dnumber

dnumber -> {dname, dmgrssn}}

• These are the FD that are semantically obvious

• Are they the only FD?

– There are many other FD that can be inferred or deduced from the semantic obvious ones

• Example

– SSN-> {dname, dmgrssn} or even
SSN -> SSN, or dnumber -> dname

Inferred FDs

• Suppose F is the set of functional dependencies that are specified on relational schema R

– Typically, the schema designer specifies the FD that are semantically obvious

– But there are other depenencies that can be inferred or deduced from the FDs in F

• The closure F+ of F is the set of all functional dependencies that can be inferred from F

Inference rules of functional dependencies

• The inference rules are a set of rules that can be used to infer all the possible FDs from F

• Some of these FD are trivial

– A FD is trivial if and only if the right-hand side (the dependent) is a subset of the left -hand side (the determinant).

• One set of FD can be equivalent to another set

What are the functional dependencies

• Semantics

– An employee works for a single department, but may have multiple skills

– Skills have an ID, and a name; an employee takes a test at a certain date to establish a skill level

• Definition: Determinate

– any attribute on which some other attribute is dependent

Do you agree with these FD?

• empIDà {empName, empPhone, deptName}

– What about empPhoneà {empID}

• deptNameà{deptPhone, deptMgr}

• skillIDà{skillName}

– What about skillNameà {skillID} ?

• empID, skillIDà{skillDate, skillLevel}

• If you had a different set of FD, were they equivalent?

How functional dependencies are used

• FD are used as information about the semantics of the relational schema

• This, in turn, is used to determine the normal form of a relation

• Normal forms are used to analyze how “good” a design is

• To determine the normal form three things are needed

– 1) a set of FDs for each relation

– 2) a designated primary key

– 3) tests or conditions for each normal form

• So, next we look at the test or conditions for the normal forms

Normalization overview

• Originally, E.F. Codd proposed three normal forms, called first, second, and third normal forms

– 1NF, 2NF and 3NF are based on the functional dependencies among the attributes of a relation

– Later a stronger definition of 3NF was proposed by Boyce and Codd and is known as Boyce-Codd normal form.

– Later, 4^th and 5^th normal forms were proposed, based on other properties

• In the normalization process, you start with a universal relation (one table with all the attributes) and functional dependencies

• Then you apply the normal form restrictions, and decompose the tables as they specify.

Normalization of data

• This is a process of analyzing the relation schemas based on functional dependencies (FD) and PK to achieve two desirable properties

– 1) Minimizing data redundancy

– 2) Minimizing insertion, deletion and update anomalies

Using normal forms

• When a test fails, the relation violating that test must be decomposed

• All these normal forms were based on FD among the attributes of a relation

• Disclaimers to use of normal forms:

– Normal forms when considered by themselves do not guarantee good db design.

– Sometimes the best design may not be the highest normal form, for performance reasons.

Review of terminology of keys

• SUPERKEY - any set of attributes that make a tuple unique

• KEY - a minimum superkey

• CANDIDATE KEY - if a relation has more than one key, each is a candidate key

• PRIMARY KEY - an arbitrary candidate key

• SECONDARY KEY - a candidate key that is not the PK

• PRIME ATTRIBUTE - a member of any key

• NONPRIME or NONKEY ATTRIBUTE - not a member of any candidate key

First Normal Form

• 1NF is now generally considered to be part of the formal definition of a relation. It states that:

– the domains of attributes must include only atomic values

– the value of any attribute in a tuple must be a single value from the domain of that attribute

– all attributes are dependent on the PK

• Look at this schema for departments:

Normalizing the relation into 1NF

• The problem with this table is that department location is multivalued.

• What are the possible solutions?

– Have the primary key be composite {dnumber, dlocation}

• Problems with this?

– If you know the possible number of locations, make attributes for three locations

• Problem with this?

– Make a new table with dname and dlocation

• This is the preferred solution

Second Normal Form

• The second normal form is based on the concept of a full functional dependency.

– If a dependency is not a full functional dependency, it is called a partial dependency

– This can occur only if the determinate is composite

• A functional dependency X -> Y is a full functional dependency if removal of any attribute from X means that the dependency does not hold any more

• A functional dependency is a partial dependency if there is some attribute A in X that can be removed from X and the dependency still holds.

Definition for 2NF

• A relation is in 2NF if:
every nonkey attribute is fully functionally dependent on the primary key.

• If a relation is not in 2NF, it can be normalized into a number of 2NF relations

– This involves decomposing the table

• Problem: Normalize the EMP_PROJ table

• Procedure:

– Figure out the dependencies, noting the determinates

• A determinate is any attribute(s) on which some other attribute(s) are dependent

– Put each determinate in a table by itself, and

– Include in each table the attributes that are dependent on that determinate

EMP_PROJ

Third Normal Form

• Look at the ENP_DEPT relation

• Is it in 1NF? In 2NF?

• Is there still a problem? What is the problem?

EMP_DEPT

The problem with the EMP_DEPT table

• DNAME depends on DNUM, not the primary key.

• This is called a transitive dependency.

• A functional dependency X-> Y is a transitive dependency if

– there is a set of attributes Z of the relation and both X -> Z and Z ->Y hold

Definition for 3NF

• A relation is in 3NF if every nonkey attribute is:

– Fully functionally dependent on the primary key (i.e. in 2NF).

– Nontransitively dependent on the primary key.

• We can normalize EMP_DEPT by decomposing it into two 3NF relations.

• Intuitively, we see that the two results represent independent entity facts.

• A natural join will recover the original relation.

Boyce-Codd Normal Form

Identify FD of this relation

• Semantics

– Lot#’s are unique only within each county

– Property_ID#’s are unique across counties for the entire state

– The tax rate is fixed for a given county(does not vary lot by lot)

– The price of a lot is determined by its area regardless of which county it is in (assume that this is the price for the lot for tax purposes

• What is the PK? Normalize it into 3NF.

Generalizing the definition of 2NF and 3NF

– To assure we do not have update anomalies we need to extend this definition to include all candidate key rather than just PK

• For 2NF we disallowed partial dependencies on any key

• For 3NF we disallowed transitive dependencies on any key

• Simpler definition for 3NF: Every nonprime attribute must be

– fully functionally dependent on every key

– nontransitively dependent on every key

Boyce-Codd Normal Form

• This is a simpler, yet stricter form of 3NF.

• There are a few cases where a DB is in 3NF and not in BCNF.

• Example: define another FD on the Lots DB

– All the lots in the DB come from only two counties

– In one county, all lots are >= one acre

– In the other county, all lots are < one acre

– The FD is area -> county_name

– Is this still in 3NF

Definition of BDNF

• A relation is in BCNF if and only if every determinate is a candidate key

• To be in 3NF but not in BCNF, must have a composite candidate key, one of whose members is determined by a non prime attribute

• Any relation not in BCNF can be decomposed into an equivalent set of BCNF relations

Summary

• In general, it is best to have a relational schema in BCNF

• If that is not possible, 3NF will do

• 2NF and 1NF are not considered good relation schema designs.

– They allow too much data redundancy which leads to update anomalies

Summary of Normal forms

• A relation is in 1NF the domains of attributes must include only atomic values

• A relation is in 2NF if every nonkey attribute is fully functionally dependent on a candidate key.

• A relation is in 3NF if every nonkey attribute is in 2NF and nontransitively dependent on a candidate key

• A relation is in BCNF if and only if every determinate is a candidate key