Query Optimization

Chapter 15

From SQL to machine language

• In a high level language like SQL, a query must be

– scanned – identifies the tokens

– parsed – checks the query syntax

– validated – checks the table and attribute names are valid

• Then an internal representation of the query is created

– The query must be translated into relational algebra

– A query tree is built with the relational operators

• The DBMS then must devise an execution strategy for retrieving the result of the query

– To do this it optimizes the query tree

Steps in processing a SQL query

• Scanning, parsing and validating

• Query optimizer module

– Translate the SQL into relational algebra

– Build a query tree, then optimize it

– Produce an execution plan from a query tree

• Code generator

– Generate the code to execute the plan

– Use algorithms available in the DBMS for each relational algebra operator

• Runtime database processor

– Run the query code to produce the result

• May be either compiled or interpreted

The query optimizer

• Disclaimer: it cannot really produce the optimal strategy, just a reasonably efficient strategy

• First, translate SQL into relational algebra

– Nested queries are decomposed into query blocks

• Then the select, from , where, group by and having clauses are translated into their relational algebra equivalent

• A query tree is then built

• Then heuristic rules are applied to “optimize” the tree

• After you have an efficient tree, algorithms to implement the relational algebra operations are applied

Example of an algorithm to implement a relational operator

• There must be algorithms to implement every relational algebra operation

• Methods for implementing the JOIN operation

– Nested-loop join (brute force)

• Retrieve every record from both relations

• Test whether the two records satisfy the join condition

– Single-loop join (one relation has an index on the attribute in the join condition)

• Retrieve each record in the unindexed relation

• Use the index to retrieve only those records in the second relation that match

• Sort-merge join (both relations can be searched for the join attribute)

• If the files are not sorted on the join attribute, they are sorted

• Both relations are scanned concurrently in order of the join attribute

• If the join attributes match, the records are joined

• Hash-join ( the records of the two relations are both hashed to the same hash file using the same hashing function on the join attributes)

Heuristic rules

• One of the main rules is to apply SELECT and PROJECT operations before applying JOIN.

– SELECT and PROJECT operations reduce the size of a file

– JOIN usually multiplies the size of the file

• The further down the tree an operations appears, the earlier it is applied

Query trees

• A query tree is a tree that corresponds to a relational algebra expression

• The input relations are the leaf nodes of the tree

• The relational algebra operations are the internal nodes

• The query parser will generate a standard initial query tree (without optimization)

• How the initial tree is built:

– Apply Cartesian cross for relations in the FROM clause

– Then the conditions (join & selections) from WHERE clause are added

– Then projections from the SELECT clause

Example query

• Find the last names of employees born after 1957 who work on a project names ‘Aquarius’.

• SQL query

   SELECT lname
FROM employee, works_on, project
WHERE pname = ‘Aquarius’
          AND pnumber = pno
          AND essn = ssn
          AND bdate > ‘1957-12-31’;

Optimization of Query trees

• Making the initial tree more efficient

– Many different relational algebra expressions correspond to the same query

– The trick is to take the initial tree and convert it into an equivalent, more efficient tree

– There are rules to help do this

– Basically, we want to move select and project as far down the tree as possible and still maintain equivalence

• This is because they limit the size of the relation

The evolving query tree

• 1) First the initial query tree with Cartesian crosses

• 2) Then move the select operations down the tree

• 3) Then apply the more restrictive operations first

– pname is a key attribute, so will retrieve only one tuple

• 4) Replace Cartesian product and select with join operations

• 5) Move project operations down the query tree

Transformation rules for relational algebra operations

• See overhead or page 519 of text

Practice optimizing a query

• Draw Draw the initial and optimized query trees for the following queries

– Select fname, lname address
from employee, department
where dname = ‘research and dnumber = dno;

– Select E.lname S.lname
from employee as E, employee as S
Where E.superssn = S.ssn
and S.sex = ‘F’;

Converting the query tree into a query execution plan

• The execution plan must include information about the access methods that exist for each relation

– Each relation is stored ordered on one field; there may be other indexes into the relation

• It must also know the algorithms that exist for the relational operators represented in the tree

– There must be a different algorithm for each access into the relation

• With this information, it can choose the algorithm to implement the operation with the given data

Query code generator

• The query tree gives the order of execution of each relational algebra operation; the execution plan involves choosing the correct algorithm to implement the operation

• Temporary results are computed, then used as input for the next operation in the query tree

• The RDBMS must include algorithms to implement the different types of relational operations

• There are often several algorithms to implement the same operation, since one may apply only to a particular storage structure and access path

Typical algorithms

• Any query that specifies order by must use an external sort algorithm

• Implementing the select operation

– These can include linear search, binary search, using a primary key or hash table, using a clustering index or a secondary key (usually a B+ tree)

• Implementing a join operation

– These can include nested-loop join, single-loop join, sort-merge join, hash-join

• Implementing project operations

– This just involves eliminating some of the columns in a table